top of page

AI Hallucinations on the Decline

Writer: Jakob NielsenJakob Nielsen
Summary: Early AI products proclaimed many falsehoods, but such hallucinations are getting fewer with modern AI products. In particular, larger models have reduced hallucination rates. Waiting for better AI models while developing complex AI products is a good strategy, but better UX design can mitigate the hallucination problem with current AI models.

 

Hallucinations were a curse during the first years of practical AI. Luckily, hallucinations are on the decline as AI models get bigger and better, with continued improvements according to the AI scaling laws.


A hallucination is simply when AI generates information that is false while presenting it is true. In other words, the system “makes up” content: a non-existent citation, an incorrect fact, or an implausible detail. One study found that 40% of ChatGPT 3.5’s cited references were hallucinated.


Early AI products must have been raised on a diet of magic mushrooms, since they frequently hallucinated. (Midjourney)


A good example of hallucinations comes from my work on this very article: I asked both OpenAI’s Deep Research and Google’s Gemini 1.5 product of the same name to survey existing research on the UX impact of hallucinations and best practices for alleviating them. I requested that the analysis should be restricted to work done after the release of GPT 3.5, since earlier work is likely to be less relevant for modern AI systems. Gemini Deep Research gave me an overview of its proposed research plan that stated that it would start with “the release of GPT 3.5 in 2020.” This year was a hallucination, since ChatGPT 3.5 was in fact launched in 2022. Luckily, Gemini asked for possible modifications to its plan before it started researching, so I easily corrected its mistake, before setting it off to think.


Hallucinating AI lowers usability by wasting users’ time, degrading the quality of AI work products, and reducing users’ trust in AI. At least hallucinations are on the decline. (Midjourney)


(By the way, using both tools for this article confirmed my previous analysis that OpenAI’s Deep Research is currently more useful than Google’s. Of course, with the pace of AI advances, this conclusion could easily change.)


Bigger AI Models Hallucinate Less

As AI gets more powerful with each generation, hallucinations are on the decline. The same study that found 40% made-up literature references in output from ChatGPT 3.5 (from 2022) found only 29% false references from ChatGPT 4, released half a year later.


The Hugging Face Hallucinations Leaderboard has subjected 102 AI models to the same hallucination benchmark, making comparisons possible. The following chart shows the hallucination rate for the 72 models for which I could find the release date.


Each dot indicates the hallucination rate of one AI model according to the HHEM-2.1 hallucination detection model. (Data from the Hugging Face Hallucination Leaderboard.)


The regression line shows that hallucination rates decline by 3 percentage points per year. If we project the regression line into the future, it “predicts” that AI will hit zero hallucinations in February 2027, which coincidentally is when I expect next-generation AI to reach the much-hyped “AGI” (artificial general intelligence).


Obviously, a regression line is not a true predictor of the future, particularly for a dataset like this, with large variability in the underlying data. However, I do expect the next-generation models that we’ll likely get in 2027 to exhibit a very low hallucination rate.


It’s getting better every year — also with respect to AI hallucinations, which have been dropping steadily. (Ideogram)


Hallucinations are dropping for two main reasons:


  • AI providers are conscious that hallucinations are one of the main impediments for lucrative enterprise applications, so they have a strong incentive to design their newer models to avoid hallucinating as much as possible.

  • AI models are getting bigger and bigger, which allows the AI to know more and be less likely to hallucinate.


The following chart replots the Hugging Face Hallucination Leaderboard for the 62 AI models for which I could find size estimates, as measured by each model’s parameter count.


The size of an AI model, measured by the number of parameters, impacts that model’s hallucination rate. Note that the x-axis is logarithmic, as is appropriate for most things relating to the AI Scaling Laws. (Data from the Hugging Face Hallucination Leaderboard.)


Here, the regression line shows that hallucinations drop by 3 percentage points for each 10x increase in model size.


The new reasoning models also seem to reduce hallucinations, with a low 0.8% hallucination rate from OpenAI’s o3-mini-high-reasoning. However, there are too few reasoning models in the data to make firm estimates of the degree to which moving further up this 3rd AI Scaling Law might help by using more inference compute to reduce hallucinations.


The jump to a new AI generation usually happens every two years and requires scaling AI by a factor of 100x, which would correspond to a 6% drop in hallucinations.


Large AI models have fewer hallucinations than smaller models. (Leonardo)


Projecting out the regression of hallucinations by model size “predicts” that we’ll reach zero hallucinations once AI models have about 10 trillion parameters. This is expected to happen around 2027. Thus, our two estimates of when AI hallucinations will stop being a serious problem (derived from release dates and model sizes) are the same.


To Err Is Also Human

Are hallucinations unique to AI? Not if we restate the problem as an information source providing incorrect information. Humans do this all the time.


For example, a meta-analysis of three medical studies in 2014 estimated that 5.08% of adult primary care patients are misdiagnosed in the United States. Worse, the discharge information prepared by internal medicine residents for patients leaving the hospital had high inaccuracy rates, given that people who require hospital treatment are usually worse off than people who simply see their primary care doctor. One study found the following rates of inaccurate information in this information:


  • Discharge medication list: 36%

  • Follow-up instruction for family physicians: 18%

  • Discharge diagnosis: 5%


Interestingly, the diagnosis error was the same (5%) in the two studies.


Much research shows that AI is better at medical diagnosis than human doctors, even though it’s currently worse at interviewing patients.


Many studies now show that AI does better than human doctors when diagnosing patients from clinical data. The question is not whether AI is perfect (it’s not), but how it performs relative to humans. (Leonardo)


Turning to the legal domain, the outcome of criminal and civil cases was often worse than predicted by the lawyers. Attorneys who stated a confidence level of 86% or higher of winning their minimally desired outcome in court only achieved this result 70% of the time. A review of 6,000 death-penalty cases found that the defense attorney had made errors in 68% of the cases (with 37% being judged as “egregiously incompetent”).


Possibly worst of all, a 2005 study of American newspapers found factual errors in 61% of 3,287 stories across 14 metro newspapers. In fact, studies from 1936 to 1999 found error rates ranging from 46% in 1936 to 55% in 1999 in American newspaper articles. Are present-day journalists more accurate? Given that the error rate in the 8 studies cited were all in the 41-61% range across many decades, you have to be an extreme optimist to believe that things are better now. Furthermore, the 2005 study found that the 61% inaccurate articles contained an average of three errors each.


Humans also say things that seemingly come out of nowhere and are false, even when they don’t intend to mislead. Fortunately, other humans are used to this fact and have developed mitigation strategies that also somewhat work against AI hallucinations. (Midjourney)


Since erroneous information is so common, people have developed ways of dealing with errors in the most important cases. In medicine, it’s common to request a second opinion from an independent doctor before talking drastic action, such as undergoing surgery. In a study of patients who sought a second opinion from the Mayo Clinic (a respected medical institution in the United States), 88% of patients ended up with a changed or refined diagnosis.


The concept of double-checking information can be taught. In one case study, college students who took a 4-hour course on evaluating internet sources dramatically improved their ability to identify incorrect information on websites: scores on a 13-point test increased from 4 to 7. (Obviously, this means that the students still missed identifying much inaccurate information, even after the training.)


Is AI better or worse than humans when it comes to providing erroneous information? That’s hard to say, because the available studies have measured so many different topics with so many different definitions of “errors.” But two things are certain:


  • Humans also make errors and provide wrong information. This is not at all infrequent.

  • Because people are accustomed to receiving error-prone answers, they have evolved strategies to partially alleviate this problem.


When assessing the acceptability of AI hallucinations, we should compare AI with the humans who are realistically available to perform the same tasks, not the world’s most perfect human for that task. AI scales, but humans don’t, so you can run the best possible current AI product, but you cannot hire the world’s best human to do the job for you.


Mitigating AI Hallucinations Through UX Design

The common strategy of seeking a second opinion also works for AI. In a 2024 study of asking AI about diabetes guidelines, the single-best AI model achieved a recall of 88% of pertinent information from the American Diabetes Association’s Standards of Care, but combining the results from multiple AI models increased the performance to 95%.


One major difference between human error and AI hallucinations is that in the case of AI, we can lower the barrier to seeking that second opinion through better UX design. For example, Deep Research (which I used for background research for this article) provides a one-click link to the source for each point it makes. Nobody has time to click all the links on the Internet, but if a certain statement is important, users are much more likely to check it when doing so only requires a single click. In any case, the easier it is to check, the more often it will be done.


Hallucination mitigation requires transparency, user empowerment, and fail-safes. These reduce the risk of users being fooled by AI-generated falsehoods, thereby improving both the outcome (users achieve their goals correctly) and the experience (users maintain trust in AI). However, these UX measures are not foolproof and do not completely solve the underlying issue — they manage it.


The effectiveness of hallucination-mitigating design depends on users actually utilizing the provided tools and information. A citation no one reads or a warning no one heeds has little value. Usability testing is essential when designing anti-hallucination features: ensuring that they are noticed, understood, and actually aid the user without being overwhelming. So far, the signs are encouraging that with thoughtful UX design, the worst effects of AI hallucinations can be blunted. Users can be kept in the loop, informed, and equipped to question the AI when needed.


Overcoming Hallucinations: Wait and be Happy

The data in the above two charts suggest that the main strategy for overcoming AI hallucinations is simply to wait for better times. I don’t believe that we’ll truly reach zero hallucinations by AGI-time in 2027, because the last edge cases are always harder to solve than we think. But it’s likely that hallucinations will become an insignificant issue by then in most domains.


Don’t worry — be happy! Wait for time to fix most of our problems with AI hallucinations. (Leonardo)


Waiting it out is not as silly a strategy as you might think. In many cases, it could take two years to develop and deploy a complex enterprise AI application. So you can start work on the application now and be happy in the expectation that things will get better, even if current hallucination rate would invalidate your project.


The “wait and be happy” strategy has indeed worked well for many of the startups funded by the leading Silicon Valley startup incubator Y Combinator. Every time a new AI model was released, many of their startup projects changed from “not working” to “working.” Since AI is expected to improve through several more generations until at least 2030, it is not at all stupid to initiate AI projects that are infeasible with current AI with the expectation that future AI will reverse this situation.


Of course, you shouldn’t start a project that assumes science-fiction-like advances in AI in a few years. The challenge is to shoot for realistic AI improvements. How to assess the degree of change we can realistically expect in two or five years? Look back that same number of years and compare the AI of yesteryear with current AI.


Turning to look where you have been is an effective strategy for predicting the amount of future change, especially when the road ahead is murky. (Leonardo)


Watch my music video about AI hallucinations (YouTube, 2 min.).

Top Past Articles
bottom of page