Summary: In a controlled experiment, consultants at a top-3 company had 33% higher productivity and achieved a 40% quality gain from GPT-4 compared to colleagues working without AI. Even among these elite knowledge workers, AI narrowed the difference between high and low performers.
Existing research on the use of AI by knowledge workers has conclusively demonstrated 3 findings:
Higher productivity. Knowledge workers produce results faster when aided by AI. Across the early case studies, people produced 66% more results per hour when using AI than when not using AI.
Better quality. Despite producing results in less time with AI, the quality of the work products was rated as better when people used AI than when they worked without AI assistance.
Narrowing skills gaps. AI improves performance metrics for both high-performing and low-performing staff, but it helps the low-performing staff the most, narrowing the gap between the two groups.
Given that high performers gain the least from AI, it would be interesting to see whether AI can help extremely high-performing individuals. A new study by Fabrizio Dell’Acqua and 8 colleagues from the Harvard Business School and elsewhere gives us the answer: “yes.”
The new research studied strategy consultants at the Boston Consulting Group (BCG), often considered the world’s second-most prestigious management consultancy. It has been included in every top-3 list I have seen of the world’s three most prestigious management consulting companies.
Though I don’t have access to IQ scores for the BCG consultants, they are likely in the top 0.5% of the population, with the most stupid of their consultants probably still in the top 1%. This is an elite group of high performers.
Even elite management consultants at the pinnacle of financial success see performance gains when they harness AI. (Rich consultant + money, by Midjourney.)
Research Method
The researchers tested 758 BCG consultants. Using a stratified sampling method, participants were randomly assigned to one of 3 conditions. The conditions were:
A control group performing tasks without artificial intelligence tools.
A CPT-4 group performing the tasks with the use of GPT-4 (the current best version of ChatGPT)
A GPT-4-training group that used the same AI tool as the second group but also received a small amount of training in current best practices for prompt engineering.
Stratification is a way to create similar participant groups according to various criteria that would otherwise generate noise in the data. Participants are still assigned to conditions randomly, but the researchers aim to have equally many participants in each condition with similar scores on the stratification criteria. This study’s stratification criteria included gender, location, tenure at BCG, personality test scores for openness to innovation, and native-English speaking status.
All participants first completed an assessment task without AI help. This provided a standardized metric to assess high versus low performers.
The participants then completed tasks according to their assigned experimental condition, with or without AI help. For both the assessment task and the experimental tasks, the quality of each participant’s solution was scored by independent experts on a 1-10 rating scale.
The tasks were designed to be representative of normal strategy consulting. They included assignments like generating ideas for new products in a specific category, defining market segments for this product, creating marketing slogans for the product, suggesting ways to test the effectiveness of the slogan, writing a focus group guide for researching the product with target customers, writing a press release, and synthesizing the lessons from the project into an outline of an article for industry practitioners.
A stereotypical management consultant from the Boston Consulting Group, like the people in the study analyzed in this article. I know it’s a stereotype because I asked Midjourney to draw me “a consultant from the Boston Consulting Group” without additional prompt details. Sadly, according to photos found on the Web, most actual consultants don’t seem to wear neckties these days. But I can uphold standards in my illustrations! My top-hat-wearing emoji is my preferred way to go all in on metaphorical stereotypes without the baggage of literal stereotypes carried by the human-looking consultant. (The benefits of non-literal characters are discussed in Scott McCloud’s classic book Understanding Comics.)
33% Productivity Gain From AI
This study does not provide much productivity data because most of the test tasks were conducted within fixed time limits, where the participants were not allowed to proceed to the next task until the time was up, whether or not they finished a task faster. This procedure didn’t incentivize participants to work more quickly, as they usually would do in real business work where time is money.
The one exception was creating an outline for a lessons-learned article. For this task, the researchers collected time-on-task data as follows:
Control group (no AI): 84 minutes
GPT-4, but no training: 65 minutes
GPT-4 with prompt training: 61 minutes
The difference between the no-AI group and the two AI groups was statistically significant at p<0.01, whereas the difference between the two AI groups was not statistically significant.
Given the lack of significant difference between the two AI groups, I’ll average them. The data shows that in an 8-hour work day, the AI users could produce 7.7 article outlines, compared with only 5.7 article outlines created by the consultants who worked the old-school way without AI. (Ignore that these elite management consultants probably work more than 8 hours daily. The relative productivity gain will be the same if we assume a 10-hour workday or any other number.)
These numbers show a productivity gain of 33% from using AI.
This is less than the 66% productivity gain that was the average outcome of the previous research. This study’s lower productivity gain is consistent with the general point that AI helps low performers more than it helps high performers since all BCG consultants are already extremely high performers.
40% Quality Gain From AI
As mentioned, the quality of the work products was rated on a 1–10 scale by a group of independent experts. Averaged across the test tasks, the scores for the 3 conditions were:
Control group (no AI): 4.1
GPT-4, but no training: 5.6
GPT-4 with prompt training: 5.8
The difference between the no-AI and AI-using participants was significant at p<0.01, and the difference between the two AI groups was marginally significant, with p values of 0.029, 0.047, and 0.053 for three different ways of running the regression analysis.
Since the rating scale is not a true ratio scale, we can’t officially calculate a percentage difference between the scores. But there is no doubt that a lift of 1.5 points or more on a 1–10 scale corresponds to a substantial quality improvement.
If we ignore the statistical caveats and compute a percentage gain anyway, we see that the AI users received 40% higher quality scores than the non-AI users.
Narrowing Skills Gap
The study replicated previous research one more time by also finding a narrowing of the skills gap between top and bottom consultants when using AI. This is notable because even the “bottom” of these consultants are still at the very high end of the human population. Thus, the difference between great and exceptionally great performers is still subject to narrowing when using AI.
AI builds skills (Ideogram).
Participants who scored in the bottom half of the initial assessment task received a mean score of 5.79 on the experimental tasks. In contrast, participants who were assessed in the top half received a mean score of 6.09. So far, the top performers got higher scores than the bottom performers. No surprise.
What’s interesting is to see how much of a lift people in the two groups received from their use of AI: bottom-half-skill participants improved their performance by 43% from the assessment task (done without AI) to the experimental tasks (done with AI). In contrast, top-half-skill participants only improved their performance by 17%.
Bad performers or good performers, both groups got better when using AI. However, the bottom performers saw more substantial gains, equating to a narrowing (though not an elimination) of the skills gap.
AI is a forklift for the mind, taking some of the cognitive heavy lifting, thus narrowing skills gaps. (Midjourney)
Toughest Task = Worse Performance With AI
The new research presents a caveat for the unrestrained use of current AI tools. The researchers deliberately created a test task that was particularly difficult and required comparing qualitative data from interview transcripts with quantitative data in a spreadsheet.
The researchers write, “While the spreadsheet data alone was designed to seem to be comprehensive, a careful review of the interview notes revealed crucial details. When considered in totality, this information led to a contrasting conclusion to what would have been provided by AI when prompted with the exercise instructions, the given data, and the accompanying interviews.”
Sure enough, AI offered the wrong solution to this problem, deliberately designed to deceive it. As a result, participants arrived at the correct answer as follows:
Control group (no AI): 85% correct
GPT-4, but no training: 71% correct
GPT-4 with prompt training: 60% correct
This outcome underscores the importance of human-AI symbiosis, where the two work together. In particular, humans should check the output of the AI for hallucinations or other mistakes. While the above data is discouraging, it’s still good to see that more than half of the AI users noticed that AI was wrong in this case and corrected it.
(But remember Nielsen’s First Law of AI: Today’s AI is the worst we’ll ever have. Future AI products, like GPT-5, GPT-6, etc., will be better and likely have fewer hallucinations and instances where they overlook subtle clues from qual data when analyzing quant data.)
Even for this trick task, those AI users who discovered and corrected the AI’s mistake created better deliverables than the non-AI control group. The mean quality score was 0.8 points higher for the GPT-4 (no training) group and 1.5 points higher for the GPT-4 (with prompt training) group than the score for the control group when only analyzing work products from those participants who got the answer right. Both differences were statistically significant at p<0.01.
AI Improves Elite User Performance
The bottom line from this new research: the old research was confirmed, which is always comforting. The existing findings were extended to also apply to elite users who already perform at a very high level:
AI improves productivity
AI improves quality
AI narrows skills gaps between high and low performers
A further conclusion is not as strongly supported by the current data but is still suggested: training in best practices for generative AI was helpful. AI training was not essential because participants who received no training still performed well. However, we should remember that the participants in this study were super-elite consultants with superior skills in quickly adjusting to new conditions. More average knowledge workers might benefit more from AI training than the BCG consultants.
The researchers measured retainment, the percentage of AI-generated text the participants kept in their deliverables. AI users with training retained more of the AI-generated text than did AI users without training. We don’t know why this is, but the researchers speculate that the users who received prompt training were able to generate better output from their use of the AI. This is plausible but not proven. Support for this hypothesis comes from an additional data set in the paper. There was a positive correlation between the amount of AI text retained in the final deliverables and the rated quality of those deliverables.
Reference
Fabrizio Dell’Acqua, Edward McFowland III, Ethan Mollick, Hila Lifshitz-Assaf, Katherine C. Kellogg, Saran Rajendran, Lisa Krayer, François Candelon, and Karim R. Lakhani (2023): “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality.” Harvard Business School Technology & Operations Mgt. Unit Working Paper No. 24-013, available at: https://ssrn.com/abstract=4573321