This year should be that of the takeoff of generative artificial intelligence (GenAI) in the company, according to many observers. One way to achieve this is through retrieval augmented generation (RAG), a methodology by which a large AI language model is linked to a database containing domain-specific content, such as data files. ‘business.
However, RAG is an emerging technology with its pitfalls.
Also: Make Room for RAG: How the AI Generation’s Balance of Power is Shifting
That’s why Amazon AWS researchers propose in a new paper to establish a series of benchmarks that will specifically test how well RAG can answer questions about domain-specific content.
“Our method is an automated, cost-effective, interpretable and robust strategy for selecting the optimal components for a RAG system,” write lead author Gauthier Guinet and his team in the book “Automated Evaluation of Retrieval-Augmented Language Models with Task- Specific Review Generation”, published on the arXiv preprint server.
The paper is being presented at the 41st International Conference on Machine Learning, an AI conference taking place July 21-27 in Vienna.
The fundamental problem, Guinet and his team explain, is that while there are many benchmarks for comparing the ability of various large language models (LLMs) on many tasks, in the field of RAG, in particular, there is no “canonical” approach to measurement that is “a comprehensive, task-specific assessment” of the many qualities that matter, including “truthfulness” and “factuality.”
The authors believe that their automated method creates a certain uniformity: “By automatically generating multiple-choice exams adapted to the body of literature associated with each task, our approach allows standardized, scalable and interpretable scoring of different RAG systems. »
To accomplish this task, the authors generate question-and-answer pairs based on material from four domains: AWS DevOps troubleshooting documents; abstracts of scientific papers from the arXiv preprint server; questions on StackExchange; and filings with the U.S. Securities & Exchange Commission, the primary regulator of publicly traded companies.
Also: Connecting generative AI to medical data has improved utility for doctors
They then designed multiple-choice tests for the LLMs to assess how close each LLM came to the correct answer. They submitted two families of open source LLMs to these exams: Mistral, from the French company of the same name, and Llama from Meta Properties.
They test the models in three scenarios. The first is a “closed-book” scenario, in which the LLM has no access to the RAG data and must rely on its pre-trained neural “parameters” – or “weights” – to find the answer. The second is what are called “Oracle” forms of RAG, where the LLM has access to the exact document used to generate a question, the ground truth, as it is called.
The third form is “classical search,” where the model must search the data set for the context of a question, using a variety of algorithms. Several popular RAG formulas are used, including one introduced in 2019 by researchers at Tel Aviv University and the Allen Institute for Artificial Intelligence, MultiQA; and an older but very popular approach to information retrieval called BM25.
Also: Microsoft Azure gets ‘models as a service,’ enhanced RAG offerings for enterprise generative AI
They then run the exams and tally the results, which are complex enough to fill tons of charts and tables about the relative strengths and weaknesses of LLMs and different RAG approaches. The authors even perform a meta-analysis of their exam questions—to assess their usefulness—based on the famous “Bloom’s Taxonomy” of education.
What matters even more than the data points from the reviews are the broad conclusions that may hold true for RAG, regardless of implementation details.
A general conclusion is that better RAG algorithms can improve an LLM more than, for example, making it larger.
“The correct choice of recovery method can often lead to performance improvements beyond those resulting from simply choosing larger LLMs,” they write.
This is an important point, given concerns about GenAI’s increasing resource intensity. If more can be done with less, it’s an interesting avenue to explore. It also suggests that the conventional wisdom in AI that scaling is always best isn’t entirely true when it comes to solving real-world problems.
Also: Generative AI is a new attack vector that puts businesses at risk, says CrowdStrike CTO
Just as importantly, the authors find that if the RAG algorithm does not work properly, it can degrade the performance of the LLM compared to the simple, closed-book version without RAG.
“A misaligned recovery component can lead to worse accuracy than no recovery at all,” Guinet and his team explain.