We want to hear from you! Take our quick AI survey and share your thoughts on the current state of AI, how you’re implementing it, and what you’re hoping for in the future. Learn more
AI agents are becoming a promising new research direction with potential real-world applications. These agents use basic models such as large language models (LLMs) and vision language models (VLMs) to take natural language instructions and pursue complex goals autonomously or semi-autonomously. AI agents can use various tools such as browsers, search engines, and code compilers to verify their actions and reason about their goals.
However, a recent analysis by researchers at Princeton University revealed several gaps in current agent benchmarks and evaluation practices that hamper their usefulness in real-world applications.
Their findings highlight that comparative evaluation of agents comes with unique challenges and that we cannot evaluate agents in the same way that we evaluate foundation models.
Trade-off between cost and accuracy
One of the main issues the researchers highlighted in their study is the lack of cost control in agent evaluations. AI agents can be much more expensive to run than a single model call because they often rely on stochastic language models that can produce different results when asked the same query multiple times.
Countdown to VB Transform 2024
Join business leaders in San Francisco July 9-11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of generative AI, and learn how to integrate AI applications into your industry. Register Now
To increase accuracy, some agent systems generate multiple answers and use mechanisms such as voting or external verification tools to choose the best answer. Sometimes, sampling hundreds or thousands of answers can increase the agent’s accuracy. While this approach can improve performance, it has a significant computational cost. Inference costs are not always an issue in research settings, where the goal is to maximize accuracy.
However, in practical applications, the budget available for each query is limited, making it crucial to control the costs of agent evaluations. Otherwise, researchers might have to develop extremely expensive agents just to get to the top of the rankings. The Princeton researchers propose to visualize the evaluation results as a Pareto curve of accuracy and inference cost and to use techniques that jointly optimize the agent for both measures.
The researchers evaluated the trade-offs between accuracy and cost of different incentive techniques and agent models introduced in different papers.
“For roughly similar accuracy, the cost can vary by nearly two orders of magnitude,” the researchers write. “Yet the cost of running these agents is not a prominent metric reported in any of these papers.”
The researchers argue that optimizing both metrics can lead to “agents that cost less while maintaining their accuracy.” Joint optimization can also allow researchers and developers to trade off the fixed and variable costs of running an agent. For example, they can spend more to optimize the agent’s design, but reduce the variable cost by using fewer contextual training examples in the agent’s prompt.
The researchers tested the joint optimization on HotpotQA, a popular question-answering benchmark tool. Their results show that the joint optimization formulation finds an optimal balance between accuracy and inference costs.
“Assessments of useful agents must take into account costs, even if we ultimately do not care about costs but only about identifying innovative agent models,” the researchers write. “Accuracy alone cannot identify progress, as it can be improved by scientifically meaningless methods, such as retesting.”
Model development vs downstream applications
Researchers also highlight the difference between evaluating models for research purposes and developing downstream applications. In research, accuracy is often the priority, with inference costs largely ignored. However, when developing real-world applications on AI agents, inference costs play a crucial role in deciding which model and technique to use.
Evaluating inference costs for AI agents is challenging. For example, different model vendors may charge different amounts for the same model. At the same time, API call costs change regularly and can vary based on developer decisions. For example, on some platforms, bulk API calls are charged differently.
Researchers created a website that adjusts model comparisons based on token prices to address this problem.
They also conducted a case study on NovelQA, a benchmark for question-answering tasks on very long texts. They found that benchmarks intended for model evaluation can be misleading when used for downstream evaluation. For example, the original NovelQA study makes retrieval augmented generation (RAG) appear to perform significantly worse than long-context models than in a real-world scenario. Their findings show that RAG and long-context models are about equally accurate, while long-context models are 20 times more expensive.
Overfitting is a problem
When learning new tasks, machine learning (ML) models often find shortcuts that allow them to perform well on benchmarks. One major type of shortcut is “overfitting,” where the model finds ways to cheat on benchmarks and delivers results that don’t translate to the real world. Researchers have found that overfitting is a serious problem for agent benchmarks because they tend to be small, typically consisting of only a few hundred samples. This problem is more serious than data contamination in training baseline models because knowledge of the test samples can be directly programmed into the agent.
To address this problem, the researchers suggest that benchmark developers create and maintain benchmark test sets consisting of examples that cannot be memorized during training and can only be solved by a good understanding of the target task. In their analysis of 17 benchmarks, the researchers found that many of them lacked suitable reference datasets, allowing agents to take shortcuts, even unintentionally.
“Surprisingly, we find that many agent benchmarks do not include reserved test sets,” the researchers write. “In addition to creating a test set, benchmark developers should consider keeping it secret to avoid LLM contamination or overfitting the agent.”
They also indicate that different types of reference samples are needed depending on the desired level of generality of the task performed by the agent.
“Benchmark developers should do their best to ensure that shortcuts are impossible,” the researchers write. “We consider this the responsibility of benchmark developers rather than agent developers, because designing benchmarks that do not allow shortcuts is much simpler than checking each agent to see if it takes shortcuts.”
The researchers tested WebArena, a benchmark tool that evaluates the performance of AI agents in solving problems with different websites. They found several shortcuts in the training datasets that allowed the agents to adapt to tasks in a way that would easily break if there were minor changes in the real world. For example, the agent could make assumptions about the structure of web addresses without considering whether it might change in the future or that it wouldn’t work on different websites.
These errors inflate accuracy estimates and lead to excessive optimism about the agents’ abilities, the researchers warn.
Since AI agents are a new field, the research and development communities still have a lot to learn about how to test the limits of these new systems that could soon become an important part of everyday applications.
“Benchmarking AI agents is new and best practices have not yet been established, making it difficult to distinguish true advances from hype,” the researchers write. “Our thesis is that agents are sufficiently different from models that benchmarking practices need to be rethought.”
Source link