Summary: AI is a general-purpose technology that can double economic growth | A study compares human card sorting with categorization produced by ChatGPT | An LLM is a super-aggregated mental model across all Internet users | ChatGPT is less lazy after tech update | A new song about ADPList’s live UX session
UX Roundup for February 5, 2024. (Midjourney)
AI to Double Economic Growth in the 2020s
The Financial Times has an interesting interview with Erik Brynjolfsson, who is the world’s leading expert on productivity research. Brynjolfsson conducted some of the research I have already reviewed on the productivity impact of AI. (Summary across many case studies: using current AI improves the productivity of knowledge workers by around 40% for those tasks where they can use AI.)
The key modifier in the above summary is that AI only helps when using AI. Sounds obvious, but there are still many knowledge tasks that are less suited for current AI.
This is where the FT interview becomes interesting. Brynjolfsson’s assessment is that for the United States as a whole, AI use will double economic growth — or maybe more than double — for the rest of this decade, as AI rolls out in more and more companies.
(This is for the USA. Since the EU has harsher regulations with their new AI Act, it’s likely that they won’t benefit from as extensive economic gains as the US. On the other hand, since Europe can import AI tools from both the US and Asia, they’ll still do well, even with less homegrown AI. The biggest economic gains will come from AI deployments, not from inventing or building AI.)
Brynjolfsson compares AI with electricity as an invention with economic impact. He mentions that the biggest gains from electricity didn’t come until 30-40 years after it was introduced in factories, though he thinks that AI will spread faster. Still, the point remains that even though it’s obvious that AI will change the way all companies work, it will take some time to work through organizational inertia. (In my overview of the time scales of UX, organizational change scored as the second-slowest, behind only societal change.) Even small consulting companies that one might think would be agile can take years to pivot from the old world to the new, for example, by adjusting staffing.
Most fundamentally, AI is a general-purpose technology (like electricity). It’s not AI itself that’s beneficial, it’s the way it accelerates other projects. Such general-purpose technologies have 3 characteristics, according to Brynjolfsson:
Pervasive: they affect most industry sectors and projects.
Rapid: they improve fast (as we certainly saw in 2023 with the improvements from GPT-3.5 to GPT-4, Midjourney v.5 to v.6, and huge advances in video and music generation). ChatGPT itself went from zero to 100 million users in two months, which is faster than fast.
Complementary: they spawn additional innovations that have synergy with the basic technology.
This last point is still somewhat lacking with AI: we have lots of examples of AI use that improve existing work processes, but not as many where AI has spawned new ways of doing things, with the associated complementary innovations. I expect this to come soon, possibly already in 2024, and this is one of the areas where UX can (and should) lead the way, with our methods for task analysis, service design, and so forth.
Electricity and AI are both general-purpose technologies that affect many industry sectors and spawn complementary new technologies, changing the way work is done. (Dall-E)
Finally, Brynjolfsson warned against what he called the “J-curve,” which is where productivity may paradoxically decline a little in the beginning before it takes off on an almost vertical climb. The reason for this dip is that companies initially need to invest in the new technology and learn how to apply it fruitfully in their specific business. This time and expense may initially be greater than the gains. Only later, once the company has discovered the optimal use of the new technology, will superior gains be harvested.
This is certainly something to watch out for, but basic AI use is extraordinarily cheap to implement (say, $20/month for ChatGPT Plus and another $20/month for Midjourney, per employee), and it leads to almost immediate gains because basic use within employees’ current workflow is so easy to implement. This favorable ROI for the initial use of AI is likely to overshadow the investments in discovering and developing more fundamental use of AI in many companies. So I think that a hockey-stick curve is more likely than a J-curve for AI profitability.
Many more interesting topics are covered in the interview, so I recommend that you read it all.
AI Replicates Card Sorting Findings
MeasuringU is the world’s leading authority on quantitative user research. They conducted an experiment comparing a card sorting study with human participants with the use of AI to structure the same components.
(Card sorting is a research method used in Information Architecture (IA) to help design or evaluate the structure of a website or application. Study participants sort topics into categories that make sense to them, which helps designers understand the users’ mental models and how they think the information should be organized.)
For their case study, MeasuringU used a set of 40 items from the website of consumer-electronics retail giant Best Buy, which were sorted by 200 human test participants. The names of the same 40 items were fed to ChatGPT-4, asking it to group them into categories. Because AI is non-deterministic, the authors performed 3 runs with ChatGPT with slightly different prompts.
The main finding from this study is that humans and AI both produced 5 categories as the suggested information architecture. There was extremely high agreement in both the suggested names for these categories and in the assignment of individual items to specific categories.
Across the 3 AI runs, the overlap with the human categorization of the items varied from 63% to 77%.
I would have been interested in seeing an analysis of a combination of the 3 runs, which, after all, is the equivalent of analyzing the human card sorting results as the combination of the contributions from each of the 200 users. In fact, since AI runs are virtually free, one could have done 200 runs with ChatGPT and combined them into one master classification. Whether this would produce better results remains to be seen, but I think it is likely.
For now, we just know the overlap between a combination of 200 humans on the one hand and a single ChatGPT classification attempt on the other. In this case study, this overlap is very high.
This finding doesn’t necessarily show that one can eliminate human test participants from information-architecture design projects. However, it does suggest that AI can be used for an initial classification that could then possibly be checked with a closed card sorting study with the target audience.
There are some weaknesses in this study. But all initial research into new problems has problems because nobody can study everything in their first attempt. Any data is better than none, and I am grateful to MeasuringU for providing empirical data, as opposed to the speculation we see in most other comments on AI.
The first weakness is that the study measures the overlap in item assignments as opposed to the usability of the resulting information architecture. Whether you employ human participants or use AI, the goal of card sorting is not sorting data, but rather to get inspiration for an IA that maximizes users’ ability to quickly find what they are looking for. Sometimes, it’s better to move items around from where they were placed by the majority of users. Sometimes, a good UX writer will create better labels than those suggested by the users (who are not writers, after all).
Sadly, most researchers today perform card sorting studies with online tools and only pay attention to the statistical findings instead of listening to thinking-aloud recordings of users explaining why they sorted a certain item into a certain category. This qualitative data is often the most valuable part of a card sorting study for creating a maximum-usability IA. Sad that it’s being overlooked. But in principle, this qual data can be had when testing with real users, whereas you don’t get it from ChatGPT. (Of course, you can ask ChatGPT why it placed a certain card in whatever category it used, but AI is notoriously poor at explaining itself, and in any case, ChatGPT’s explanation will not reflect how your customers think of your product line.)
A second weakness in the study is that it dealt a bad hand to ChatGPT by only providing it with the names of each of the 40 products. A fair comparison between human card sorting and AI-fueled categorization actually requires providing different information for the two conditions. For human card sorting, you need each card to be concise and only contain minimum information, or users won’t be able to get an overview of the cards and complete the sorting exercise. But AI is under no such constraints, and for it to do its best job, it should be fed not just the product names but the full writeup of each item from the website.
My guess is that ChatGPT would have produced better classification if it had known more information about each item.
Card sorting, as imagined by Midjourney’s new Niji 6 mode which creates Anime-inspired artwork.
Same prompt in Midjourney, retaining the Japanese office theme, but turning off the Anime drawing style. I prefer the metaphorical image created with Niji. The photorealistic image seems more like a messy office than a project to bring structure to a mass of information. (See real photo of actual card-sorting setup in my article about usability labs.)
LLM As Super-Aggregated Mental Model
The MeasuringU card sorting article I discussed above contained a throwaway remark that I find to be deeply profound and to possibly have implications far beyond card sorting and information architecture. The authors (Jeff Sauro, Will Schiavone, and Jim Lewis) say that large language models like ChatGPT can be considered “a type of mental model based on enormous amounts of human-generated text.”
After all, all current LLMs are constructed by reading the entire Internet, meaning that they vacuum up and aggregate all the different ways millions of humans have thought about millions of different things. In this way, the AI becomes a kind of averaged brain with an averaged mental model of how humans in general think about things.
This generalized mental model may work quite well in suggesting ways to think about common things, like the consumer products sold by Best Buy. Not everybody will classify, say, a refrigerator in the same way, but many people probably think roughly the same about refrigerators. Most people will probably say that it’s common (in rich countries) to have a refrigerator in the kitchen. Some may also think that such appliances would fit in a garage or a man cave, so there would not be full agreement among the Internet’s millions of comments about refrigerators. Also, different people store different things in their refrigerators. But a rough consensus about the main uses of fridges is likely, and any LLM would embody this main conclusion, much as it could also rattle off a set of alternative uses of refrigerators.
A key benefit of using ChatGPT for UX design inspiration (whether drafting up UX writing or drafting up IA categories) might be that its first proposal is likely to roughly reflect any broad consensus among the masses. In contrast, anything you might create yourself will reflect internal thinking and specialized expertise beyond the ken of normal people.
Mental models are the assumptions and perceptions users have about how a system should work, based on their previous experiences and interactions. These models guide users’ expectations and behaviors when interacting with new interfaces, influencing their usability and the overall user experience. (Midjourney)
ChatGPT Is Less Lazy After Tech Update
OpenAI has launched a tech update to GPT-4 where they claim to have addressed the laziness problem that has plagued ChatGPT for the last several months. GPT Laziness manifests itself as poor performance where the AI only performs parts of the requested work. Particularly for lengthy operations, like summarizing a long article, it had a tendency to stop before the end, annoyingly requiring the user to have to issue an unnecessary prompt like, “please keep going.”
I commented on the laziness of ChatGPT in my December 18, 2023 newsletter, so I had definitely noticed this myself. Similarly, I have now noticed its performance getting better. Bravo, OpenAI.
ChatGPT is no longer quite as lazy after the last tech update. (Midjourney)
Live on ADPList February 22 (ft. New Song)
I will have a live broadcast event with superstar designer Sarah Gibbons on February 22, hosted by ADPList. I used Suno AI to make a song to announce this event: listen on Instagram (52 secs.).
Advance registration is required for this session (free). We have about 5,000 UX fanatics registered already, but there are still seats available.
8,000 people from around the world will watch live when usability pioneer meets design ace on ADPList on February 22. (Midjourney, using the new Niji 6 mode)
The video from my earlier fireside chat with ADPList is available (78 min.)