Summary: AI is reshaping the landscape of creativity and innovation. AI outperforms humans in both general ideation and the production of academic research ideas, but people still discriminate unfairly against AI-generated content and ideas.
A new research study by Noah Bohren and colleagues from the University of Lausanne in Switzerland confirms many previous studies: AI exhibits high creativity, surpassing human creativity. (See my overview of 12 AI creativity studies.)
I mention this new research because it’s always nice when new research confirms old research. The credibility of research findings increases, the more different scientists from different institutions in different countries discover roughly the same thing, despite using different methods and specific study design.
New research keeps finding that AI has high creativity. It’s a great ideation tool. (Leonardo)
This study compared 1,250 humans with two AI systems: ChatGPT-4 and Google’s Bard. The humans and the AIs were both asked to perform various creativity tasks, such as “describe a town, city, or society in the future.” A further set of 3,336 humans were then asked to rate the degree to which the ideas were novel, surprising, and useful.
(Bard did poorly in this study but it was an early AI system that Google released in a panic when they were taken by surprise by the success of ChatGPT. Google has now released a much better AI model, so I won’t discuss the Bard findings here.)
As a further twist, some of the humans were ideating on their own and some used an AI system in a co-creation session.
The average creativity ratings were as follows, on a 1-10 scale:
Humans (without AI help): 5.35
Human co-creating with ChatGPT 4: 5.84
ChatGPT-4: 7.24
Thus, ChatGPT-4 generated ideas with dramatically higher creativity scores than the humans. (Differences significant at p < 0.01)
AI gives us creativity on demand. Like a vending machine. (Midjourney)
Co-Creation Less Good in This Study
Human performance did improve when people were using AI in a co-creation scenario, but it did not reach the level of AI creating alone. This is a bit surprising, because we would normally expect co-creation to be better than either party working alone. One potential explanation is that the human participants were recruited from an online panel, as opposed to people working in creative professions. Thus, the majority of human participants were probably low-creativity individuals which may be why they dragged the AI down.
The paper includes one result to support thus interpretation: the participants were scored for the creativity component of their jobs on a 1-10 scale. For each level on this scale, ideas was rated 0.036 higher in creativity. People with the most creative jobs (level 10) are 9 levels higher than people with the least creative jobs (level 1), people in these high-creativity jobs thus generated ideas rated at 0.32 more creativity points than people in the low-creativity jobs. This is a much smaller difference than I would have expected.
Even looking only at the most creative ideas doesn’t change the base finding much. Of the top 5% of ideas in terms of creativity ratings, 59% were from ChatGPT, 21% were from humans working alone, and 20% were from human-AI co-creation.
Usually, human-AI co-creation is better, though not in this particular study. (Ideogram)
Impact of Competing With AI: Men vs. Women
Usually, sex differences don’t matter to user interface design. Yes, men and women are sometimes interested in different content, as anybody can see by comparing a men’s magazine with a women’s magazine. But in terms of the user interfaces and interaction techniques, sex differences are so minute that they don’t matter for practical design projects.
If you measure hundreds of users with great precision, you can sometimes find differences on the order of 0.1% in how efficient men vs. women are at using a particular design element. But such small differences pale against the several hundred percent performance difference between good and bad design. So, for example, if you optimized for woman and completely disregarded male participants in your user research, you might indeed penalize men by 0.1%. But optimizing for women would still gain those men 200% or more, relative to a design that hadn’t been optimized for any humans. So even in this extreme case, the men would still benefit.
The present study is one of the few that found a bigger sex difference.
The study included a condition where the people generating ideas were told that their ideas would be competing with ideas generated by AI. The actual creativity task was the same in both conditions, but some participants knew that they were competing with AI, whereas other participants didn’t know this.
Here’s the effect of knowing that you are competing with an AI:
Men: +0.08 points in the rated creativity of their ideas
Woman: -0.15 points in the rated creativity of their ideas
The stated changes are relative to the rated creativity of male or female participants who had not been told they were competing with AI. (Differences only marginally significant.)
The net difference between men and women is 0.23 points on the 1-10 creativity scale (which is roughly a difference of 4% between the sexes if we ignore the fact that the creativity ratings are not a ratio scale). This is fairly small, so we should not make much of this finding. However, one possible interpretation is that competing with AI stirred the men’s competitive juices and motivated them to perform a little better. In contrast, maybe the women were either a little intimidated or a little demotivated by competing with AI.
Meatware Bigotry
The University of Lausanne study confirmed previous research in one more way: humans currently have a strong bias against AI.
The study included a condition where the independent human raters were told that the ideas they were rating were generated by either an AI or by a human. They were not told who had produced any given idea, but they were made aware that some of the ideas came from AI. For each idea, raters were also asked to estimate whether it came from AI or from a human.
The results show that ideas that were deemed to be AI-generated were rated lower than ideas deemed to be human-generated. On average, ideas that were guessed to be from AI received creativity scores of 0.12 points less (on the 1-10 creativity scale) than the identical ideas were scored by raters who were not aware of the source of the ideas. (Difference significant at p < 0.05)
While not a huge difference, this finding does show that humans currently discriminate against AI and judge its work as poorer, purely because it’s an AI. Or in this case, purely because they thought that an idea came from AI. As it turns out, people are fairly bad at assessing whether something was written by an AI or a human. Those ideas that were indeed generated by ChatGPOT-4 were only correctly identified as AI in 61% of cases (whereas 39% of AI ideas were deemed to be human). Similarly, of human-generated ideas, only 63% were deemed to be human (and 37% were guessed to be AI ideas).
To simplify: Suppose a reviewer was given 6 proposals, 3 from AI and 3 from humans. Then this reviewer would correctly classify 2 of the AI proposals and 2 of the human proposals, but misclassify the remaining 2 proposals.
This new finding echoes the outcome of a recent study of empathy, where participants rated dialogs with AI as showing more empathy than when they were communicating with a human. But the empathy scored dropped when participants were told that they were chatting with AI (whether truthfully, or as a deception when they were actually chatting with a fellow human).
Two different studies of two different domains: creativity and empathy. Same findings:
AI scores higher than humans on both creativity and empathy
The scores go down when participants believe they are dealing with AI
It’s sad that people suffer from this meatware bigotry. We are evolutionary creatures with a strong tendency to prefer our own kind and detest strangers. This nasty habit is probably why we killed off all the Neanderthals rather than living with them in harmony. From the perspective of evolution in nature, this make sense: for me to survive, you must die. In nature, resources are scarce, so if Neanderthals (or AI) do well, there will be less to eat for me and the generations who should propel my DNA through time.
But cave-age evolutionary logic shouldn’t persist in the modern age, which is no longer a zero-sum game. We grow the economy by collaborating and inventing new things. AI is a great help to humanity, so we should not discriminate against it.
I hope that as humanity gains more experience with AI, we will stop viewing it so negatively.
I am particularly hopeful that increased personal exposure to AI will turn people more positive. There’s a small amount of research suggesting that personal AI use does improve attitudes toward AI. Right now, many people still don’t use AI, so all they know about it is media scaremongering. Even worse, people may have tried an underpowered obsolete AI such as ChatGPT 3.5 (or Bard which scored horribly in the present study). Using bad AI certainly gives people a bad experience, and if they give up experimenting with later models these people will be left with an inaccurate impression of AI.
I think many people are still worried about being chased out of their house by smart toasters and other AI-enabled appliances. Science-fiction movies haven’t helped with AI’s image problem. (Ideogram)
AI Beats Scientists in Research Creativity
OK, so AI beats regular humans on simple creativity tasks. But what about high-end creativity? Glad you asked. Chenglei Si and colleagues from Stanford University addressed this question in another study published last month.
This project studied AI’s ability to generate ideas for cutting-edge research, comparing it with human researchers. They chose the academic field of Natural Language Processing (NLP).
The AI used in this project was Claude Sonnet 3.5, which was a cutting-edge frontier model as of mid-2024. (The AI scaling law predicts even better performance when AI moves to the next generation of capability in late 2024 or early 2025.) Claude 3.5’s native capabilities were supplemented with RAG (retrieval-augmented generation) where the AI model was fed 120 recent research papers in this domain. This step ensured that the AI would generate new research ideas on top of the existing research in the NLP domain.
To compare with AI’s ability to generate good research ideas, the study recruited 49 human researchers in the field. I think it’s fair to classify these human researchers as junior, but accomplished scientists. On average, they had 12 papers listed in Google Scholar, which had received a mean of 477 citations from other scientists, and their mean h-index was 5.
The h-index is a measure of scientific impact. For comparison, my publications have 130,839 citations in Google Scholar, and my h-index is 122. Ben Shneiderman, probably the most published scientist in the user interface field, beats me with an h-index of 137, though his citation count is “only” 118,548. For the sake of the experiment, I looked up the h-index of the 14 full professors at Carnegie Mellon University’s Human-Computer Interaction Institute which is probably the world’s leading academic center in that field. These truly senior scientists have a mean h-index of 79.
To judge the research ideas, the study recruited a further 79 human scientists from the NLP field. These reviewers were slightly more senior, with a mean number of 15 papers, 635 citations, and an h-index of 7. These are still numbers that are more characteristic of an assistant professor than a senior researcher, but the reviewers were certainly sufficiently accomplished to provide qualified assessments of the strengths of a research proposal. (In fact, almost all of these reviewers had served as referees for major AI publications.)
In total, the 49 human researchers produced 119 ideas for new research projects, whereas Claude Sonnet 3.5 produced 109 ideas.
The reviewers rated these new research ideas on a 1-10 scale for novelty, excitement, feasibility, and expected effectiveness. AI beat the human scientists in 3 of the 4 proposal qualities, though for the one parameter where humans outscored AI, the difference was not statistically significant, so we can’t say whether humans were really better or just benefited from a random fluke.
Novelty: rated 4.8 for humans and 5.6 for AI (p<0.01)
Excitement: rated 4.6 for humans and 5.2 for AI (p<0.05)
Feasibility: rated 6.6 for humans and 6.3 for AI (not significant)
Expected effectiveness: rated 5.1 for humans and 5.5 for AI (not significant)
Bottom line: current AI is more creative than junior human scientists when it comes to producing ideas for new research studies.
AI is better than human scientists at getting ideas for new things to test in a research study. (Midjourney)
The authors’ qualitative analysis of the research ideas found that the human ideas were generally more grounded in existing research and practical considerations, and that the humans prioritized the predicted feasibility and effectiveness of having to carry out the proposed research ideas. In contrast, AI was less constrained which likely lead to its higher novelty and excitement scores.
AI Is Creative: Deal With It
Machines dream. Humans judge. Numbers speak. Bias persists. Creativity evolves. Collaboration emerges. Future beckons. Adapt or fade.
Podcast
As an example of AI creativity, I asked NotebookLM to create a podcast where its two hosts discuss this article plus 5 earlier articles I have written about AI creativity. This was an experiment to see whether the AI could make sense of a range of source materials and integrate them well.
Here’s the podcast: https://youtu.be/ap2ij1fxR9E (YouTube, 13 min. video). I think it did a good job at bringing together points from across my articles into a coherent story.
Only the video soundtrack was generated with NotebookLM. I added the image side myself, using a base image of two podcast hosts made with Midjourney that I animated with Kling, using image-to-video.
References
Noah Bohren, Rustamdjan Hakimov, and Rafael Lalive: “Creative and Strategic Capabilities of Generative AI: Evidence from Large-Scale Experiments.” IZA – Institute of Labor Economics discussion paper IZA DP No. 17302, September 2024. (40-page PDF file.)
Chenglei Si, Diyi Yang, and Tatsunori Hashimoto: “Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers,” arXiv:2409.04109, September 2024. (94-page PDF file.)