top of page
Writer's pictureJakob Nielsen

UX Roundup: Poor AI Performance Conversing with Patients | AI Character Mashups | 2025 AI Trends | Accelerating Science | AI Progression

Summary: AI does worse than human doctors when eliciting clinical information in conversations with patients | AI-created mashups of famous characters and actors | AI trends for 2025 | Accelerating scientific discovery with AI | AI progression

UX Roundup for January 13, 2025. (Midjourney)


AI New UI Paradigm

I have a new video in which I explain how AI is the first new user-interface interaction paradigm in 60 years, moving us from command-based interactions to intent-based outcome specification. (YouTube, 3 min.)


See also my article about intent-based outcome specification (though the video is a faster and more popular way of communicating this concept)


Poor AI Performance Conversing with Patients

Many research studies have now found that AI performs better than human doctors in diagnosing medical problems based on clinical information. A recent study also found that AI was good at ordering the correct follow-up tests to differentiate between different possible diagnoses for particularly difficult cases. (See also good journalistic coverage of this paper in New Scientist.)


However, this begs the question of how the clinical information about the patient’s problem is derived in the first place. Much of the early information about a case comes from a conversation between the doctor and the patient. A new paper by Shreya Johri and many coauthors from Harvard Medical School and other hospitals and universities addresses this issue.


The authors had 3 leading AI models (GPT-4, Mistral, and LLaMA-2-7b) plus the obsolete GPT-3.5 model engage in simulated conversations with simulated patients in the form of another AI. I am rather disappointed that the study didn’t use real humans. As I’ve said before, user research requires representative human customers (or, here, patients). We can’t just simulate the users with AI.


That said, as a first study, it’s still interesting to see how AI performs when interviewing patients, even if they are only simulated patients.


The bottom line is that GPT-4 was about half as good when relying on interviewing patients itself as when it was presented with written summaries of the patient information. (26% correct diagnosis when interviewing the simulated patients vs. 49% correct diagnosis based on written case summaries.)


In many instances, the AI models failed to ask essential follow-up questions, leading to missed critical information that could guide effective treatment. (This finding parallels recent research on the limited ability of current AI to ask good follow-up questions during usability study sessions.) The authors also point out that future AI models need the ability to recognize and interpret non-verbal cues, including facial expressions and tonal variations, to better understand patients during consultations.


It is great to see research progress from limiting medical AI to interpreting existing clinical data to evaluate the requirements for AI to provide all stages of healthcare, starting with that initial conversation with the patient. I hope that future research will test AI interacting with actual human patients and not simulated ones.


Our future AI doctors need the ability to have a free-form conversation to elicit information from patients who walk into the clinic complaining about a problem. (Leonardo)


What can we conclude from this study if we assume that the findings will generalize to AI interacting with real human patients instead of simulated ones?


I don’t think we can conclude anything, since widespread clinical use of AI to diagnose and treat patients will probably not happen for another 2-3 years, even in poor countries that will benefit the most from AI doctors. By then, we’ll be on AIs that have advanced another one or two generations and have vastly different capabilities than GPT-4.


However, the study provides a strong guidance for the development of these next-gen AI models: they need to develop the skills to interact better with humans and not just rely on book learning, the way current AI mostly does.


If we assume either no advances in AI or immediate deployment of clinical AI, then the conclusion is that AI should serve as senior consulting experts to advise on diagnosis (since they’re better than humans at diagnosis, once presented with clinical case data). Also, AI specialists can be replicated on demand for fields where more specialists are needed than can be served by the available human specialists (for example, geriatrics and mental health). For the next five years, AI should not be the primary physician who interacts directly with the patients.


I made a short song about this research (YouTube, 2 min.).


Content Mashups

The start of 2025 saw many famous characters pass into the public domain, as copyright expired on works like Tintin, Popeye, and the song “Singing in the Rain” (but not the movie with that name). (As a comment, while copyright is important to incentivize creators, it shouldn’t last 100 years, but for now, that’s the rule in the United States.)


Many AI influencers have commented that the expanding list of public-domain characters and storylines could form the basis for many interesting mash-ups as creators combine previously separate characters and settings and make new videos with AI.


Mashups have been popular in corporate media in the past. For example, I still clearly remember a comic book from 50 years ago that featured a race around the globe between The Flash and Superman to determine once and for all which superhero was the fastest. These two characters are still under copyright, so we can’t make that contest into a video yet.


A race between two superheroes who are clearly not The Flash and Superman. (Leonardo)


Famous movie stars will remain copyrighted for some time, since the celebrity culture didn’t take hold until the 1950s, with a few exceptions like Charlie Chaplin. For the next 25 years, we should expect legacy movie studios to mine those legacies and produce works starring AI-reproduced versions of famous movie stars from the past.


AI will soon produce films starring reproductions of actors who never costarred while alive. (Grok)


Is this a good or bad thing? Doesn’t matter, because it will happen. Audiences do flock to known names and faces.


I see two possible futures:


  • We get a lock-in of characters and “actors” who are endlessly recycled in new stories, which build their fame beyond their original level. By 2100, Marilyn Monroe may be so famous that no new blonde (human or avatar) has a chance to be cast for any commercially viable project.

  • Conversely, audiences may tire of recycled fame, and niche characters and avatars will rule for ever-more specialized productions. We can design precisely the most compelling “person” (who might be a racoon) for any story. Even better, the characters can be individualized through Generative UI, so the video I see and the video you see will star different avatars, even if they tell the same story. We’ll each see whatever avatar appeals the most to us.


AI Trends for 2025

Last week, I shared my own 6 trends for UX in 2025 (avatar explainer and music video).

IBM has published predictions for AI trends in 2025 from Martin Keen, who’s an IBM Fellow. (In many high-tech companies, “fellow” is a guru rank one level above the one I scored 31 years ago as a Sun Microsystems Distinguished Engineer which was awarded to the top 0.1% of Sun’s talent. Thus, Keen will likely be in the top 0.01% of IBM’s talent.)


Keen predicts the following 7 trends:


  1. AI agents: Intelligent systems that can reason, plan, and take action. They can break down complex problems, create multi-step plans, and interact with tools and databases to achieve goals. However, current models struggle with consistent logical reasoning and complex scenarios.

  2. Inference Time Compute: Newer AI models are extending inference processing to "think" before providing an answer. The amount of "thinking" time varies based on the complexity of the request. This allows for improved reasoning without retraining the model.

  3. Very Large Large Models: The next generation of large language models (LLMs) are expected to have many times more parameters than current models, potentially upwards of 50 trillion.

  4. Very Small Models: Models with only a few billion parameters that can run on laptops or phones are becoming more prevalent. These models are often tuned for specific tasks.

  5. Advanced Enterprise Use Cases: AI will move beyond basic tasks like improving customer experience and automating IT operations to more complex applications such as advanced customer service bots, proactive IT network optimization, and adaptive cybersecurity tools.

  6. Near Infinite Memory: Context windows for LLMs are increasing, allowing chatbots to potentially remember everything they know about a user at all times.

  7. Human-in-the-Loop Improvements: While AI can sometimes outperform humans, combining human expertise with AI should lead to even better results. Improved systems will allow professionals to better integrate AI tools into their workflows.


The first two predictions are safe, since they’re already happening. Number 3 and 4 are interesting, because they are essentially the opposite of each other. However, I agree: 2025 will almost certainly see the release of next-generation AI based on huge models but also the boiling down of AI models to smaller sizes that are cheaper to run.


2025 will be simultaneously by huge AI models that have many more parameters than anything seen before and tiny AI models that are cheap and run locally. (Midjourney)


Accelerating Scientific Discovery with AI

Demis Hassabis from Google DeepMind was awarded the 2024 Nobel Prize for Chemistry, and his Nobel lecture has now been published. This 29-minute video is well worth watching for a popularized overview of his groundbreaking work on using AI for protein folding. It’s a good reminder that AI reaches far beyond the language and image models most of us use daily. Specialized AI tools can sometimes achieve even more in targeted domains.


A fun anecdote: Hassabis’ interest in AI stems from playing chess competitively as a child when he was the captain of the England junior team. He and his team trained on early chess computers, and when he was 11, he implemented a simple AI program that succeeded in beating his 5-year-old brother.


After describing his AlphaFold AI system for folding proteins, Hassabis  generalizes the lessons from this project, arguing that AI can be applied to a wide range of scientific challenges that involve navigating vast search spaces, optimizing for specific objectives, and leveraging large datasets.


AI will dramatically accelerate the pace of scientific discovery by automating tasks, analyzing vast amounts of data, and generating novel hypotheses. Hassabis predicted a future where AI tools empower scientists to make breakthroughs more rapidly and efficiently than ever before.


Any process in nature follows the laws of nature, and enough pretraining can build an AI that embeds these laws of nature. This again will make the AI capable of working out problems in biology or other natural sciences. Returning to the protein folding problem for which Hassabis was awarded the Nobel prize, there are about 10 to the power of 300 different possible ways a given protein can be folded, but only one of them is the correct shape that the protein actually takes in the real world. Nature “knows” the right way to fold a protein and does this in milliseconds, not having to try out any suboptimal shapes. Once his AI derived a sufficient understanding of these laws of nature, it too could fold protein shapes fast. AlphaFold derived the foldings for 200 million proteins in a year, which is a huge improvement over the 170,000 protein structures derived by human scientists through decades of work before AI tackled the problem.


Hassabis suggested that AI could become the “description language” for biology, similar to how mathematics is the descriptive language for physics.


He argued that while mathematics has been incredibly successful in describing physical phenomena through equations, the sheer complexity of biological systems requires a different approach. With its ability to learn patterns and model complex relationships, AI might be better suited to capture and represent the intricacies of biology.


If this turns out to be true, AI will revolutionize our understanding of biological systems, just as mathematics was instrumental in advancing our knowledge of the physical world.


Mathematical formulas have long been the way we understand physics. Similarly, AI will become the way we understand biology and medicine — fields too complex to be captured by math. (Leonardo)


Jensen Huang Keynote: Growth of AI

The head of NVIDIA, Jensen Huang, gave a keynote at last week’s CES (YouTube, 92 min.). I feel that Huang is emerging as the lead visionary for the growth of AI, and the keynote is worth a listen, even though it’s long.


Also, even though Huang himself is charismatic enough, in a charmingly nerdy way, the presentation is substandard for a major-event keynote. He throws around a seemingly endless progression of numbers to describe the technologies his company is launching, but without using proper storytelling techniques to conceptualize these tech advances or relate them to each other. The audience is left stunned by data without an understanding of this data. The slides are also fairly useless.


My advice to Huang: Hire Duarte. (The world’s leading experts on storytelling and presentation design.) They even have a course called DataStory that would be perfect for whoever is helping put together these keynotes.


Gripes over, let’s get to two interesting takeaways from the keynote, other than the fact that NVIDIA launched a boatload of products with meaningless but impressive-sounding specs.

First, Huang pointed out that we now have three scaling laws for AI. I would add, that where there are three, there might be four, so don’t be surprised if leading AI labs announce one more scaling law in 2025 or 2026.


We’ve long known about the scaling law for AI pre-training: the more data you feed to AI and the more compute you allocate to training on this data, the more capable the resulting model. As the biggest models have already read the entire Internet, the pre-training scaling law continues to work well with synthetic data generated by the most powerful models.

The problem is that we need about 100x more training to progress to the next generation of AI capability. Going from 100 to 10,000 GPUs for a training run was fine, but going from 10,000 to the 1M GPUs we’ll need for the Ph.D.-level intelligence in 2027 will be expensive.


The 3 AI scaling laws: Pre-training (absorbing more data), post-training (getting better at dealing with the data), and test-time reasoning (more accurate conclusions from the data). (Ideogram)


The second AI scaling law is for post-training. After developing a model, we train it to be stronger by techniques such as reinforcement learning: We have the AI process a prompt and propose one or more answers which are then rated by quality. I expect this approach to be the main way we make AI better at designing user interfaces and analyzing usability tests. The reason this is a scaling law is that experience now shows that the more RL (or other post-training techniques) we apply, the stronger the resulting retrained AI becomes.


Finally, in late 2024, we got the 3rd AI scaling law: test-time reasoning, as used in OpenAI’s o1 and o3 models. Here, we simply have the AI “think harder” by breaking down the problem into components and reasoning through the steps and/or trying multiple approaches to the problem which it evaluates on its own before presenting the best result to the user. Again, this scales: the more AI thinks, the better the final results.


A crucial insight (which is visualized poorly in Huang’s slide with this point) is that all three scaling laws operate in parallel. We can add more compute at all three stages and scale all three ways of improving AI results. However, because the improvements are logarithmic in the amount of compute spent at each stage, the return on investment is highest when applying extra compute to those scaling laws that haven’t scaled so far yet.


For example, xAI has revealed that it used 100,000 Hopper GPUs for the recently-completed pre-training run of Grok 3. To move up one level, they would have to 100x this investment and build a datacenter with 10M GPUs. They haven’t revealed how much compute they’re using for post-training, but since this is a newer scaling law, the likelihood is that the investment is much smaller, meaning that it will be fairly cheap to 100x it. The same is true for scaling test-time compute, with the important caveat that this expense is not a one-time training investment but an inference-time cost that’s incurred for every user query. (On a related point, OpenAI has stated that they lose money on the $200/month subscription needed to use o1-Pro. It’s so good that subscribers use it a lot, but each and every prompt is expensive to answer.)


Huang’s second point of general interest is his model of the 4 stages of AI:


  • Perception AI (2012): speech recognition, image processing

  • Generative AI (2022): write text, make images and video

  • Agentic AI (2025): take steps on behalf of the user, software coding, use websites

  • Physical AI (2026): robots, self-driving cars, doing household tasks


In particular, Huang predicted that robots will experience the equivalent of a “ChatGPT moment” soon and turn from a curiosity to a practical reality with countless applications.


The four stages of AI. (Ideogram)

 

Top Past Articles
bottom of page