top of page

UX Roundup: Film Festival | AI Therapy | AI Radiology | AI Creative Writing | Grok 3

Writer's picture: Jakob NielsenJakob Nielsen
Summary: AI film festival | AI as good as human therapists | AI beats human radiologists | AI Creative Writing Test: GPT 4o vs. OpenAI Deep Research | Grok 3 launched

UX Roundup for February 24, 2025. (Leonardo)


AI Film Festival

The third annual AI Film Festival returns this spring to New York and Los Angeles to celebrate artists at the forefront of film and technology. The deadline to submit your AI creation for the awards is March 31.


You can watch the 2024 winners on the festival page linked above. I recommend watching, since there are already some awesome AI films out there, despite the primitive state of the technology last year.


The best AI films are already amazing. Can’t wait to see this year’s winners. (Leonardo)


Nicéphore (a French “AI Creative Studio”) posted a great AI film they had made for President Macron’s recent AI Summit in Paris. The images are pure AI, but the audio track was reportedly made with traditional moviemaking Foley methods. Despite only being 3 minutes, I think the film takes too long to get to the point: you have to watch to the end to get the punchline — but that’s worth doing. The film is a good example of storytelling, regardless of the technology used, which is exactly what we want. (I’m thinking I should just give you the spoiler, since it’s a historical fact: when the world’s first film was shown to an unsuspecting public in Paris in 1895, the audience fled from the screen because it showed a locomotive steaming right at them, and people thought they were about to get run over by an actual train, having never seen a moving image.)


For copyright reasons, this is not a still from the movie. Imagine being scared by having a locomotive come straight at you. (Midjourney)


AI Equals Human Therapists

One more study showing that AI beats human clinicians: This new research is by S. Gabe Hatch and a bunch of coauthors from various hospitals and universities in the United States and Switzerland.


The authors studied couples’ therapy and compared 15 experienced human therapists (mostly with Ph.D.s in clinical psychology) with ChatGPT 4. Rather than treating real clients, the study was based on 18 writeups (“vignettes”) of cases. While this reduces the real-world application of the findings, the vignette method assures that humans and AI were working from the same stimuli.


The human therapists and the AI wrote their responses to the clients described in the vignettes. Other experts then rated these responses, and the best chosen from both the set of human responses and AI-generated responses. This is a step in this study that is unrealistic for the use of human therapists because it would be prohibitively expensive to have 7 or 8 experts suggest responses for each client and then have other experts rate these suggestions before presenting the winner to the client. In the case of AI, it would be cheap enough to have one set of AIs generate multiple responses and then have another AI select the best — if it were possible for an AI to judge the quality of therapy responses. (I doubt this is true now, but AI will likely gain this ability no later than 2030 when superintelligence is achieved.)


AI equals humans as a therapist for couples, at least in theory, though likely not in clinical practice yet. (Midjourney)


For each vignette, the best human response and the best AI response were further rated by a panel of 830 regular people (i.e., not therapists, but potential clients), who randomly received either an AI or a human response for review. Results:


  • When assessing whether a human or an AI wrote the response, the panelists were only slightly better than flipping a coin: 56% of human responses were correctly judged to be human, and 51% of AI responses were correctly judged to come from a computer. (In other words, 44% of human responses were estimated to come from AI, and 49% of AI responses were assessed as human.)

  • The AI-generated responses were rated as more emphatic and “culturally competent” than the responses from the human therapists, though these differences were not statistically significant.

  • The AI stigma effect was confirmed: responses that panelists thought were written by a human were rated higher (29.46) than responses that panelists thought were written by AI (23.78). The highest ratings of all were for responses that were actually written by AI but which panelists thought were human (29.97).


Overall conclusion: In this study, AI and human therapists were equally good, but ratings dropped when the panelists thought that the response was generated from AI.

Based on AI stigma, the scene imagined in my illustration is not realistic for the time being. If a couple walked into a therapist’s office and were met by a robot, they would probably walk right out again.


Even disregarding AI stigma in the hope that it will vanish once people get robot helpers at home to do their laundry and babysit their kids, AI will need to improve its ability to interview patients in real life instead of relying on clinical data that has already been collected.


Thus, this study doesn’t show that the time has come for AI psychiatrists. But it does refute the often-claimed position that AI can’t show empathy. Pragmatically speaking, AI therapy has multiple benefits over human therapists:


  • Much cheaper.

  • Available in any location and language, including poor countries or remote villages.

  • More private and confidential.

  • Updates to best practices spread instantaneously, whereas human experts are notoriously reluctant to change their well-honed style.


Human experts (whether surgeons or other disciplines) resist changing how they’ve always done things, whereas AI can be instantly updated worldwide when better processes become available. (Leonardo)


AI Beats Human Radiologists

I may have to stop posting about studies finding that AI is better than human doctors at diagnosis because they’re coming fast and furious now that the “AI” condition has been upgraded to GPT-4. (Early studies used the much weaker GPT-3.5.) Academia is always a year behind the real world, which matters more in AI than in most fields, so we may have to wait until 2026 to read papers comparing OpenAI Deep Research with human doctors. I predict it’ll be found to beat them even harder than GPT-4 did in the 2025 papers that are mainly based on data collected in 2024.


One of the latest papers compared breast cancer detection by AI or human radiologists: “Screening performance and characteristics of breast cancer detected in the Mammography Screening with Artificial Intelligence trial (MASAI): a randomised, controlled, parallel-group, non-inferiority, single-blinded, screening accuracy study” by Veronica Hernström and 9 colleagues from Lund University in Sweden and various hospitals in Sweden and Norway.

The study involved 105,486 women and revealed a 29% increase in cancer detection rates (without elevating false-positive rates) and caused a 44% reduction in radiologists’ workload from using AI to read mammograms.


Traditionally, in Sweden, each mammogram is reviewed by two human radiologists (if their diagnoses disagree, there’s a discussion of how to proceed). In the new research, half of the patients had their scan first assessed by AI, and only if it scored the patient as being at high risk were two radiologists asked to review. Low-risk patients (according to the AI) still had a single human radiologist review their scan, which is why the workload reduction was only 44% and not closer to 100%. In the future, when even better AI becomes available, and the medical system becomes more accustomed to AI diagnosis, I would expect that one could avoid any human involvement in the many low-risk cases.


The AI system in the study was version 1.7.0 of Transpara, a specialized radiology AI, as opposed to a general-purpose AI like ChatGPT or Claude. While one data point doesn’t prove anything, this case study does indicate the potential for more investment in special-purpose AI. When we can save 44% of expensive radiologists, that can fund much specialized development efforts.


The headline metrics from the study were obviously the combination of more accurate patient diagnoses and cost reductions. (Usually, you get one or the other in the medical system.) But another interesting finding is that AI was 24% better at detecting early-stage, node-negative invasive cancers, suggesting a downward-stage migration of cancer treatment. Earlier diagnosis of breast cancer translates into less unpleasant treatment (e.g., avoiding chemotherapy) and higher survival rates.


Reviewing medical scans should be transitioned from human doctors to AI in cases where AI has proven to have better diagnostic accuracy. Human review will soon be considered malpractice. (Midjourney)


Listen to my song about the medical research reviewed in this newsletter (YouTube, 3 min.).


Haiku Battle: GPT 4o vs. OpenAI Deep Research

ChatGPT 4o was supposedly given a creative writing upgrade. I put it to the test, comparing it with OpenAI’s highest-end model, “Deep Research.” Both were asked for Haikus about Jakob Nielsen’s usability heuristic #6: Recognition Rather Than Recall.


(See article about my 10 usability heuristics for the full list.)


Comparing the creative writing skills of GPT 4o and OpenAI Deep Research (Midjourney)


Here are the 3 best Haikus from each AI model:


ChatGPT 4o

 

Click, don’t memorize —

paths unfold like open doors,

guiding eyes with ease.

 

Options displayed clear —

Ease the mind's heavy burden,

No need to recall.

 

Icons, prompts, and cues —

cognition rests, eyes wander,

effort fades to none.

 

 

OpenAI Deep Research

 

Less to memorize,

design displays needed clues —

memory at rest.

 

Mind holds seven chunks;

design offloads what it can —

burdens melt away.

 

Hear a once-loved song,

forgotten lyrics return;

silence yields no rhyme.

 

Recognition Rather Than Recall. (Leonardo)


Which AI model wrote the best Haikus? Despite supposedly being a nerd, Deep Research produced some quite poetic interpretations, based on its deeper (ha!) understanding of the heuristic. In contrast, ChatGPT 4o seems too literal.


When I posted these Haikus to my LinkedIn followers, Patrycja Olesiejuk tried the interesting experiment of asking a third AI (DeepSeek R1) to score the Haikus. DeepSeek preferred ChatGPT 4o — to a great extent because it felt that the third of Deep Research’s Haikus strayed from directly addressing recognition over recall. DeepSeek also felt that 4o had more evocative phrasing and metaphors. (Whereas I dinged 4o for its overly literal recital of icons, prompts, and cues — GUI elements I would include in a lecture or article about my heuristic but which seem too heavy-handed for poetry.)


It's a debate, and there is no way to measure what constitutes the best poetry, other than asking people to score the contestants and compute average ratings. However, despite DeepSeek’s analysis, I stand by my (subjective) assessment that Deep Research wrote better Haikus in this little experiment. Taste and judgment are some of the few skills where humans still outrank AI, and my judgment is that Deep Research’s Haiku about the once-forgotten song is in fact a good poem about my own heuristic, even though it doesn’t involve UX design. If I were still a professor teaching my heuristics, this is a Haiku I could see myself showing in class to spark a discussion. And making people think is a good thing for creative writing (even if it’s famously bad for UI design).


I realize I’m deviating from the Haiku tradition by making them in English, but the concise format is irresistible to me. (Midjourney)


By the way, I misspoke in my video about “Service as Software” when I said, “for twenty bucks, an expert team appears” [to become your team of consultants about anything]. After experiencing OpenAI’s Deep Research, I now recommend spending $200/month on this service. That’s still nothing compared to what you might spend on consultants, but the difference is that you get the report in 8 minutes instead of 8 weeks — meaning that it’ll do much more good for your project.


That video was posted December 18, 2024, so only slightly more than two months ago, and my budget recommendation has already changed by 10x. That’s how fast AI changes. (All the other points in the video remain true, though: I still believe most services will be provided via AI very soon. Therapy and radiology are simply the two examples for which I had papers to report this week.)


Grok 3 Launched

Grok 3 launched a week ago. According to some benchmark, it was indeed “the smartest AI on Earth,” as promised in the pre-launch hype. Even so, I was a little disappointed, because I had hoped for a bigger leap ahead. Grok 3 is only a little better than the best of the released OpenAI models and possibly a little worse than the unreleased OpenAI’s o3 model (though we don’t know because o3 is not available to the public and thus can’t be measured on the AI leaderboard which is based on user ratings).


As of February 20, Grok 3 tops the leaderboard with an Elo score of 1403. In contrast, Google Gemini 2.0 Flash Thinking, ChatGPT 4o, and DeepSeek R1 have Elos of 1385, 1377, and 1362, respectively. Grok’s 18-point lead over the silver-medal AI corresponds to roughly 5% higher win rate in the bot-vs-bot comparisons on which the leaderboard is built.


Grok 3 includes an image-creation mode. Here, I asked it to draw a picture of the release of the new Grok 3 AI product. I don’t think Midjourney has anything to fear for now. (Grok 3)


However, leading aside my hopes, Grok 3’s improvement over Grok 2 is actually exactly what one should have expected. The launch event said that Grok 3 was trained on 10x the compute of Grok 2. And while that sounds like a lot of AI training, the AI Scaling Laws say that an AI model needs 100x the training to level up a full generation’s worth of added intelligence. AI scaling laws are logarithmic, so that moving up from GPT-4 (the common AI level of 2024) to the next generation requires 100x. Then moving to the AGI-level expected by 2027 requires 100x100 = 10,000x. And finally achieving superintelligence by 2030 will require 100x100x100 = 1Mx the compute used to train GPT-4. (Some of this added compute demand can be substituted by algorithmic efficiency, as we saw with the revolutionary Chinese model DeepSeek R1.)


In short, giving Grok 3 10x the training compute as Grok 2 absolutely made it much smarter, but it didn’t result in a generational leap ahead in AI intelligence, and we shouldn’t have expected that.


Grok 3 was trained on xAI’s supercluster of 100,000 Hopper GPUs. The launch event revealed that they have already doubled the supercluster to 200K GPUs and are currently working on making it another factor 5x bigger, which will result in a training cluster of 1M GPUs. This will be enough to take Grok 3.5 or 4 (whatever they’ll call their next model) to that long-desired next level.


Even though Grok 3’s smartness level didn’t blow me out of the water, the speed with which Grok has improved is nothing but astounding. Grok 1 was released in November 2023, Grok 2 came in August 2024 (9 months later), followed by Grok 3 only 6 months later. Faster and faster cadence, and even though each release was probably only half a generation better than the previous model, the step from Grok 1 to Grok 3 is certainly equivalent to a full generation of AI scaling — something that usually takes two years but which xAI achieved in 15 months. xAI is moving at 60% higher speed than the competition, which allowed them to catch up despite starting late. It even looks like their speed is accelerating, so Grok 4 could be a marvel — if they can keep accelerating.


The secret weapon of xAI is two-fold:


  1. AI itself has no secret sauce. Once something has been invented, other labs get it fast. xAI said it implemented several innovative tricks from DeepSeek R1, which was released only 4 weeks before Grok 3.

  2. Execution speed rules. This is where xAI has its true advantage, by being allied with Tesla and Space X — companies that probably have the world’s best hardware and manufacturing engineers outside China. xAI can build superclusters faster than any non-Chinese company, so as long as China remains behind in GPU chips, xAI can outrace any AI firm.


Grok is on growth hormones: it grew fast from release 2 to release 3. (Leonardo)


Grok 3 has a usability problem in the confusing names used for its advanced features: DeepSearch, Think, and Big Brain. Supposedly, DeepSearch relies more on finding fresh information from the Internet, whereas Think and Big Brain deploy progressively more inference-time reasoning compute. These distinctions will likely prove too subtle for average users, and the naming scheme doesn’t help much.


I used Grok 3 for most of the lyrics in my music video about this week’s newsletter (YouTube, 3 min.), though I also used one verse from Gemini. I was impressed with the quality of Grok’s lyrics compared with Gemini and ChatGPT 4o (from which I sourced lyrics but didn’t use any).

I gave Grok the same challenge as I posed to OpenAI’s 4o and Deep Research a few days ago (previous news item): write Haikus about Jakob Nielsen’s usability heuristic number 6, “Recognition Rather than Recall.”


First, here are the Haikus I got from Grok 2, just before the upgrade:

 

Icons speak in sight,

Words need not be held in mind,

Clarity in view.

 

Choices laid before,

No digging through memory,

Intuitive flow.

 

Buttons clear and bright,

Functions known without recall,

Simplify the task.

 

I don’t think Grok 2’s Haikus are as good as the ones from o4 and Deep Research: they are pretty pedestrian and literal. This is to be expected, since Grok 2 was widely acknowledged to be the weakest of the big AI models and the main thing that was impressive about it was how quickly it was built.


It’s more interesting to see what Grok 3 came up with:

 

Familiar shapes glow,

Mind finds peace in what it knows,

Recall fades away. 

 

Buttons sing their names,

Memory rests, unburdened,

Choice blooms on the screen. 

 

Mistakes gently caught,

Suggestions light the way home,

No need to remember.

 

I do think Grok 3 did better than Grok 2. Better than OpenAI Deep Research (previous news item)? Maybe, but that’s a more even match. What do you think?


Grok improved its Haiku-composition skills in release 3. (Leonardo)

Top Past Articles
bottom of page