How we built Parloa's model evaluation system
The language model market moves at a rapid pace. In just one week, Azure could announce a deprecation, a new open-source model could top the leaderboard, while another provider changes its pricing. Each of these events presents the same questions to Parloa: Is this change material to how Parloa should operate, and how quickly can we find out?
For a long time, it felt like we could never find out quickly enough. Evaluating a new model against real Parloa scenarios meant weeks of setup work, manual comparisons, and results that were hard to reproduce when someone asked how we got there. The process was slow because it was ad hoc every time. We were consistently running into two main bottlenecks:
Scalability: When a new model drops, you might want to evaluate it against five different scenarios, across three performance dimensions, and compare it to two existing baselines. Those are thirty data points that each require setting up the right environment, running the right inputs, and capturing the right outputs consistently.
Reproducibility: If we decide today that Model A outperforms Model B on instruction-following in an insurance scenario, that result needs to hold up six months later when someone asks us to justify the decision. Without a persistent, versioned record of exactly what we ran and how, that confidence degrades quickly.
Knowing we needed to not just to evaluate models faster, but make model evaluation a reliable, repeatable process that any applied scientist or product manager (PM) could trust and act on, we built AMP Dojo System (Dojo).
How Dojo works
The core abstraction in Dojo is to separate what stays fixed from what gets changed. Everything else in the design follows from that choice.
A test case is a controlled environment that defines a specific scenario, such as an insurance claim, multilingual booking flow, or a customer attempting to reach billing before completing authentication. The agent configuration, mock tools it has access to, simulated caller, and conversational inputs are constructed once and reused across many evaluations.
The component under test is the variable element, most commonly an LLM, but extensible to any part of the stack: a system prompt template, a skill executor, a routing configuration. The rest of the system, the system harness, remains constant so that any difference in output can be attributed to the component being swapped.
A test job runs a component under test through a test case and produces conversational transcripts as the raw artifacts of the evaluation. The transcripts are then assessed against evaluation criteria: explicit, pre-defined pass/fail conditions mapped to specific phenomena. Did the agent ask for the order number before proceeding? Did it follow the instructions to stay within scope? Did it trigger the tool call at the right moment?
Important to note is that these criteria are not part of the test case and can be applied to pre-existing transcripts, meaning a new evaluation question can be answered without rerunning the simulation from scratch.
The result of aggregating those assessments across all conversations in a run creates the benchmark: a structured, comparable view of how a given component performs across the scenarios that matter to us.
Early results
With Dojo, evaluations that took weeks can now be completed while we sleep, operating from run start to results within an 8-12 hour window.
The speed gets the headline, but what the speed changes is really what matters:
When evaluation is slow and manual, model questions get rationed. You don’t run a benchmark to satisfy a hunch because the cost of finding out requires too much from your human team. As a result, you make fewer model decisions, with more guesswork in each one.
When Dojo takes on all of this work, more questions get asked and decisions get made on evidence. Automating the grunt work did not just make us faster. It made the organization's model intelligence less expensive to produce.
As an example, when GPT-4o and 4.1 were slated for retirement, we had a forced migration on our hands. With Dojo, the decision of what we migrated to rested on actual performance data across our own scenarios.
The design choices that held
A few important decisions shaped how Dojo behaves:
Reproducibility as a first-class requirement. Every test job persists its full configuration alongside its results, meaning the benchmark you ran six months ago can be rerun with the same configuration. This matters for auditability. A PM asking why we recommended a model migration deserves an answer that can be traced back to a specific run, not a subjective summary.
Separating the harness from the component under test. This sounds obvious in retrospect but required discipline to implement. While the temptation was to couple the scenario and the model configuration, this structure would make reuse harder. By separating the harness and the model, the same test case can be run against any component, which is what makes the system modular and effective, not just fast.
Evaluation criteria as a post-processing step. Decoupling the criteria from the simulation means we can ask new evaluation questions about runs that have already happened. If a new phenomenon becomes relevant—maybe measuring tool calling laziness across all historical runs as an example—we can apply that criterion retroactively without re-simulating anything. This has proven more valuable than we even originally anticipated.
Three different consumers, one system. Applied Scientists define and run benchmarks. Engineers integrate new models and maintain the framework. PMs consume the results to make decisions. Building one system that serves all three without compromising any of them was central to our design.
What we learned
When we started benchmarking models, we thought most of the differences in our results would come down to the models themselves, but that wasn’t the case at all. The biggest errors came from our own simulation harness. At one point, for example, the tool that formatted conversation transcripts was tagging tool responses and system messages as if they were from the agent, which ended up giving us false failures in our evaluation. Until we noticed and fixed it, we were sometimes ranking models based more on weirdness in our setup than on what the models were actually doing.
Another surprise was that every time we made the simulation a little more accurate, it messed with results we’d already recorded. If we changed things like having the agent speak first, matching the customer’s reply style, or only letting one tool call per turn, the numbers we’d been tracking would all move around. That posed a new challenge because while it's easy to reproduce results if you never change the test setup, actually getting things right means tweaking as you go. We’re still figuring out how to balance keeping things steady with making improvements.
Lastly, the evaluation cost we report isn’t just about the model itself. It also depends on how the simulation is set up. The way we get a model to show its reasoning changes how many tokens we use, for example. If we ask a model to spell out its thinking in a separate box versus using its built-in reasoning, the billing is different. We have to track both. You could run the same model on the same scenario with two different configs and end up with two totally different costs.
A tradeoff we would reconsider
Setting up a local test path that didn’t rely on our production platform let us keep working even when production integration posed challenges. As time went on, however, we spent more and more time copying all the little details from production into our local test setup—things like how messages are structured, who talks first, how conversations end, and how tool calls work. The closer we tried to match production, the more it felt like we were just building another version of the system. If we could do it over, we’d figure out from the start which questions really need the full production system and which ones could use the local path, instead of letting the local environment grow every time a new problem was introduced.
What’s next
As Dojo accumulates data based on how different models perform across different task types, it becomes the empirical foundation for routing decisions: knowing that a cheaper model handles summarization reliably, while a more capable one earns its cost on nuanced multi-turn reasoning, for example.
LLM evaluation is Milestone 1 of Dojo. Milestone 2 will extend benchmarking to any component of AMP, channels, skills, and executors, so the same rigor we apply to model selection will apply to any product decision. Milestone 3 will open the system to our Agent Architects, so evaluation is not bottlenecked with Applied Science.


:format(webp))
:format(webp))
:format(webp))