Tuesday, July 22, 2025
Google search engine
HomeA.IStudy Reveals AI Chatbots Exhibit Overconfidence Despite Errors

Study Reveals AI Chatbots Exhibit Overconfidence Despite Errors


AI chatbot
Credit: Unsplash/CC0 Public Domain

Artificial intelligence chatbots are everywhere these days, from smartphone apps and customer service portals to online search engines. But what happens when these handy tools overestimate their own abilities?

Researchers asked both and four large language models (LLMs) how confident they felt in their ability to answer trivia questions, predict the outcomes of NFL games or Academy Award ceremonies, or play a Pictionary-like image identification game. Both the people and the LLMs tended to be overconfident about how they would hypothetically perform. Interestingly, they also answered questions or identified images with relatively similar success rates.

However, when the participants and LLMs were asked retroactively how well they thought they did, only the humans appeared able to adjust expectations, according to a study published in the journal Memory & Cognition.

“Say the people told us they were going to get 18 questions right, and they ended up getting 15 questions right. Typically, their estimate afterwards would be something like 16 ,” said Trent Cash, who recently completed a joint Ph.D. at Carnegie Mellon University in the departments of Social Decision Science and Psychology. “So, they’d still be a little bit overconfident, but not as overconfident.”

“The LLMs did not do that,” said Cash, who was lead author of the study. “They tended, if anything, to get more overconfident, even when they didn’t do so well on the task.”

The world of AI is changing rapidly each day, which makes drawing general conclusions about its applications challenging, Cash acknowledged.

However, one strength of the study was that the data was collected over the course of two years, which meant using continuously updated versions of the LLMs known as ChatGPT, Bard/Gemini, Sonnet and Haiku. This means that AI overconfidence was detectable across different models over time.

“When an AI says something that seems a bit fishy, users may not be as skeptical as they should be because the AI asserts the answer with confidence, even when that confidence is unwarranted,” said Danny Oppenheimer, a professor in CMU’s Department of Social and Decision Sciences and co-author of the study.

“Humans have evolved over time and practiced since birth to interpret the confidence cues given off by other humans. If my brow furrows or I’m slow to answer, you might realize I’m not necessarily sure about what I’m saying, but with AI, we don’t have as many cues about whether it knows what it’s talking about,” said Oppenheimer.

Asking AI the right questions

While the accuracy of LLMs at answering trivia questions and predicting football game outcomes is relatively low stakes, the research hints at the pitfalls associated with integrating these technologies into daily life.

For instance, a recent study conducted by the BBC found that when LLMs were asked questions about the news, more than half of the responses had “significant issues,” including factual errors, misattribution of sources and missing or misleading context. Similarly, another study from 2023 found LLMs “hallucinated,” or produced incorrect information, in 69 to 88% of legal queries.

Clearly, the question of whether AI knows what it’s talking about has never been more important. And the truth is that LLMs are not designed to answer everything users are throwing at them on a daily basis.

“If I’d asked ‘What is the population of London,’ the AI would have searched the web, given a perfect answer and given a perfect confidence calibration,” said Oppenheimer.

However, by asking questions about future events—such as the winners of the upcoming Academy Awards—or more subjective topics, such as the intended identity of a hand-drawn image, the researchers were able to expose the chatbots’ apparent weakness in metacognition—that is, the ability to be aware of one’s own thought processes.

“We still don’t know exactly how AI estimates its confidence,” said Oppenheimer, “but it appears not to engage in introspection, at least not skillfully.”

The study also revealed that each LLM has strengths and weaknesses. Overall, the LLM known as Sonnet tended to be less overconfident than its peers. Likewise, ChatGPT-4 performed similarly to human participants in the Pictionary-like trial, accurately identifying 12.5 hand-drawn images out of 20, while Gemini could identify just 0.93 sketches, on average.

In addition, Gemini predicted it would get an average of 10.03 sketches correct, and even after answering fewer than one out of 20 questions correctly, the LLM retrospectively estimated that it had answered 14.40 correctly, demonstrating its lack of self-awareness.

“Gemini was just straight up really bad at playing Pictionary,” said Cash. “But worse yet, it didn’t know that it was bad at Pictionary. It’s kind of like that friend who swears they’re great at pool but never makes a shot.”

Building trust with artificial intelligence

For everyday chatbot users, Cash said the biggest takeaway is to remember that LLMs are not inherently correct and that it might be a good idea to ask them how confident they are when answering important questions.

Of course, the study suggests LLMs might not always be able to accurately judge confidence, but in the event that the chatbot does acknowledge low confidence, it’s a good sign that its answer cannot be trusted.

The researchers note that it’s also possible that the chatbots could develop a better understanding of their own abilities over vastly larger data sets.

“Maybe if it had thousands or millions of trials, it would do better,” said Oppenheimer.

Ultimately, exposing the weaknesses such as overconfidence will only help those in the industry that are developing and improving LLMs. And as AI becomes more advanced, it may develop the metacognition required to learn from its mistakes.

“If LLMs can recursively determine that they were wrong, then that fixes a lot of the problems,” said Cash.

“I do think it’s interesting that LLMs often fail to learn from their own behavior,” said Cash. “And maybe there’s a humanist story to be told there. Maybe there’s just something special about the way that humans learn and communicate.”

More information:
Quantifying Uncert-AI-nty: Testing the Accuracy of LLMs’ Confidence Judgments, Memory & Cognition (2025). DOI: 10.3758/s13421-025-01755-4

Citation:
AI chatbots remain overconfident—even when they’re wrong, study finds (2025, July 22)
retrieved 22 July 2025
from https://techxplore.com/news/2025-07-ai-chatbots-overconfident-theyre-wrong.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.





RELATED ARTICLES

Leave a reply

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments