Why Adding a Human Doesn’t Automatically Make AI Better

The assumption that humans plus AI will always outperform either alone has become a cornerstone of how organizations are deploying AI today. But what if that assumption is wrong — or at least, far more complicated than we think?

In Episode 4 of AI Horizons,  Prasanna “Sonny” Tambe, faculty co-director of Wharton Human AI Research, hosted Gérard Cachon, Fred R. Sullivan Professor of Operations, Information, and Decisions at the Wharton School, and Alex Moehring, assistant professor at Purdue University’s Daniels School of Business, to examine the real-world friction points in human-AI collaboration. Drawing on empirical research and economic theory, the conversation challenged some of the most widely held beliefs about how humans and AI work together, and offered a more grounded way forward. The following are key takeaways from that discussion.

Humans with AI don’t automatically outperform humans without it, even when the AI is excellent.

In a large-scale study involving hundreds of professional radiologists, Moehring and collaborators found that radiologists given access to an AI diagnostic tool performed, on average, about the same as those who didn’t have access to it, despite the AI outperforming roughly three-quarters of human radiologists on its own. The finding was stark: a highly capable tool produced no measurable improvement when paired with human reviewers. For organizations rolling out AI tools and assuming immediate productivity gains, this is a critical reality check. Deployment alone is not enough.

The problem isn’t that humans ignore AI — it’s how they use it.

Radiologists in Moehring’s study weren’t tuning out the AI; their assessments did shift in response to its predictions. The issue was more nuanced: humans tended to underweight the AI’s signal relative to their own judgment, and crucially, failed to account for the fact that they and the AI were often looking at the same underlying information. This “double-counting” led to worse performance when the AI was uncertain. As Moehring put it, “humans are not optimally using AI tools,” and that gap between current usage and optimal usage represents a significant, largely untapped opportunity for organizations to improve outcomes.

As AI gets better, the incentive problem gets worse

Cachon’s research introduces an economic lens that most organizations aren’t applying: when AI handles tasks correctly most of the time, it becomes increasingly costly to motivate employees to rigorously inspect its output. Workers rationally calculate that careful review is unlikely to catch an error, so effort declines. “The price you have to pay as a firm owner to motivate that inspection can get quite high, especially as AI gets even better,” Cachon noted. The implication is counterintuitive but important: improving AI quality doesn’t resolve the oversight problem, it can actually intensify it. Organizations need to think about incentive design, not just tool quality.

The most valuable human skill in an AI-augmented workplace may be knowing when not to use AI.

Both researchers converged on a skill that doesn’t get enough attention, which is the ability to judge which tasks AI handles reliably and which it doesn’t. Cachon described this as a “third skill” beyond just reviewing or fixing AI output. A radiologist who knows which pathologies AI reads well, and which it doesn’t, can delegate confidently in the first case and step in meaningfully in the second. “Having humans that have the judgment of knowing when to let AI take the task, and when to step in, that’s an incredibly powerful skill,” Cachon said. This mirrors traditional management judgment, and organizations should start treating it as a core competency to develop.

Better AI tools are not the same as better human-AI teams. Design them differently.

A key insight from Moehring’s research is that training the best-performing AI model is not the same as training the best model for human-AI collaboration. In some cases, a slightly weaker AI (one trained on information independent of what the human already sees) can actually improve team performance by eliminating the double-counting bias. The dominant paradigm of optimizing AI on benchmarks and handing it to humans assumes that benchmark performance translates into real-world gains. It often doesn’t. “It’s not obvious that the best performing model is actually going to be the best for humans,” Moehring noted. Firms deploying AI should evaluate human-AI team performance, not just model performance in isolation.

Workflow design matters as much as the technology itself.

Organizations focused on which AI tool to use may be missing the more important question: how is the human review process structured around it? Cachon emphasized that anything making the inspection process faster or easier for humans will improve outcomes. That could mean redesigning review interfaces, building AI tools that help humans audit other AI output, or, as Tambe noted, periodically removing AI access to keep human skills sharp. Moehring added that in settings where AI confidence is high, automating those decisions entirely and redirecting human attention to harder, higher-judgment cases may produce better results than keeping a human nominally in the loop on everything.

This content was created with the assistance of generative AI. All AI-generated materials are reviewed and edited by the Wharton AI & Analytics Initiative to ensure accuracy, clarity, and alignment with our standards.

About Wharton AI & Analytics Insights

Wharton AI & Analytics Insights is a thought leadership series from the Wharton AI & Analytics Initiative. Featuring short-form videos and curated digital content, the series highlights cutting-edge faculty research and real-world business applications in artificial intelligence and analytics. Designed for corporate partners, alumni, and industry professionals, the series brings Wharton expertise to the forefront of today’s most dynamic technologies.