Amazon offers new AI benchmark to measure RAG

An overview of Amazon’s benchmarking process for RAG implementations of generative AI.

AmazonAWS

This year should be that of the takeoff of generative artificial intelligence (GenAI) in the company, according to many observers. One way to achieve this is through retrieval augmented generation (RAG), a methodology by which a large AI language model is linked to a database containing domain-specific content, such as data files. ‘business.

However, RAG is an emerging technology with its pitfalls.

Also: Make Room for RAG: How the AI Generation’s Balance of Power is Shifting

That’s why Amazon AWS researchers propose in a new paper to establish a series of benchmarks that will specifically test how well RAG can answer questions about domain-specific content.

“Our method is an automated, cost-effective, interpretable and robust strategy for selecting the optimal components for a RAG system,” write lead author Gauthier Guinet and his team in the book “Automated Evaluation of Retrieval-Augmented Language Models with Task- Specific Review Generation”, published on the arXiv preprint server.

The paper is being presented at the 41st International Conference on Machine Learning, an AI conference taking place July 21-27 in Vienna.

The fundamental problem, Guinet and his team explain, is that while there are many benchmarks for comparing the ability of various large language models (LLMs) on many tasks, in the field of RAG, in particular, there is no “canonical” approach to measurement that is “a comprehensive, task-specific assessment” of the many qualities that matter, including “truthfulness” and “factuality.”

The authors believe that their automated method creates a certain uniformity: “By automatically generating multiple-choice exams adapted to the body of literature associated with each task, our approach allows standardized, scalable and interpretable scoring of different RAG systems. »

To accomplish this task, the authors generate question-and-answer pairs based on material from four domains: AWS DevOps troubleshooting documents; abstracts of scientific papers from the arXiv preprint server; questions on StackExchange; and filings with the U.S. Securities & Exchange Commission, the primary regulator of publicly traded companies.

Also: Connecting generative AI to medical data has improved utility for doctors

They then designed multiple-choice tests for the LLMs to assess how close each LLM came to the correct answer. They submitted two families of open source LLMs to these exams: Mistral, from the French company of the same name, and Llama from Meta Properties.

They test the models in three scenarios. The first is a “closed-book” scenario, in which the LLM has no access to the RAG data and must rely on its pre-trained neural “parameters” – or “weights” – to find the answer. The second is what are called “Oracle” forms of RAG, where the LLM has access to the exact document used to generate a question, the ground truth, as it is called.

The third form is “classical search,” where the model must search the data set for the context of a question, using a variety of algorithms. Several popular RAG formulas are used, including one introduced in 2019 by researchers at Tel Aviv University and the Allen Institute for Artificial Intelligence, MultiQA; and an older but very popular approach to information retrieval called BM25.

Also: Microsoft Azure gets ‘models as a service,’ enhanced RAG offerings for enterprise generative AI

They then run the exams and tally the results, which are complex enough to fill tons of charts and tables about the relative strengths and weaknesses of LLMs and different RAG approaches. The authors even perform a meta-analysis of their exam questions—to assess their usefulness—based on the famous “Bloom’s Taxonomy” of education.

What matters even more than the data points from the reviews are the broad conclusions that may hold true for RAG, regardless of implementation details.

A general conclusion is that better RAG algorithms can improve an LLM more than, for example, making it larger.

“The correct choice of recovery method can often lead to performance improvements beyond those resulting from simply choosing larger LLMs,” they write.

This is an important point, given concerns about GenAI’s increasing resource intensity. If more can be done with less, it’s an interesting avenue to explore. It also suggests that the conventional wisdom in AI that scaling is always best isn’t entirely true when it comes to solving real-world problems.

Also: Generative AI is a new attack vector that puts businesses at risk, says CrowdStrike CTO

Just as importantly, the authors find that if the RAG algorithm does not work properly, it can degrade the performance of the LLM compared to the simple, closed-book version without RAG.

“A misaligned recovery component can lead to worse accuracy than no recovery at all,” Guinet and his team explain.

Source link

What's Hot

Travel the World for Less with Home Exchange: Explore Like a Local, Live Like a Local

How to watch CNN’s Harris Waltz interview | 2024 US Election

New Zealand damages boat on land on first day of America’s Cup

Amazon offers new AI benchmark to measure RAG

New Zealand damages boat on land on first day of America’s Cup

The Supreme Court has indicated it would side with Trump if the election is close.

OpenAI and Anthropic sign U.S. government contract for AI research and testing

Britain’s Ineos beats America’s Magic on Day 1 of the America’s Cup; New Zealand drops boat from crane | Professional National Sports

Extra sleep on weekends could cut heart disease risk by one-fifth – Study | Heart Disease

Why Honeywell is betting big on Gen AI

Travel the World for Less with Home Exchange: Explore Like a Local, Live Like a Local

How to watch CNN’s Harris Waltz interview | 2024 US Election

New Zealand damages boat on land on first day of America’s Cup

The Supreme Court has indicated it would side with Trump if the election is close.

AdsPower: See you at Affiliate World Europe 2024 in Budapest!

TEMU Affiliate Program 2024: Earn up to £100,000 per month!

Hard Bacon files for bankruptcy as Google search changes strain affiliate marketing business

Getting Started in Affiliate Marketing: How to Make Passive Income in 2024

Our Picks

Travel the World for Less with Home Exchange: Explore Like a Local, Live Like a Local

How to watch CNN’s Harris Waltz interview | 2024 US Election

New Zealand damages boat on land on first day of America’s Cup

Most Popular

Working It guide to AI at work

Meta AI is fun, accessible, and free. Maybe it’s time to make AI chatbots a part of your life | Technology News

Generative AI Might Be Overrated

Subscribe to Updates

What's Hot

Amazon offers new AI benchmark to measure RAG

Related Posts