AI Horizons: Safety, Ethics, and Data Rights in the Age of AI

AI systems are built on data. Every Large Language Model (LLM), from chatbots to medical tools, learns from vast collections of text, images, and domain-specific material. They depend on high-quality, often human-labeled datasets to stay accurate and adaptable. Without them, models risk becoming brittle, biased, and subject to the familiar rule: “garbage in, garbage out.”

At Wharton Human-AI Research’s (WHAIR) third annual Business and Generative AI Conference in San Francisco, Luyang Zhang, PhD student at Carnegie Mellon University, presented his paper “Fair Share Data Pricing: Data Valuation for Large Language Models,” which directly addresses this topic. Beibei Li, Professor of IT and Management at Carnegie Mellon University, and the paper’s co-author, joined us to kick off our first episode of this year’s AI Horizons webinar series.

In conversation with Lynn Wu, Associate Professor of Operations, Information and Decisions at the Wharton School, Li – whose work in this area bridges data science, AI, and human behavior, and has been recognized by organizations like Google, Adobe, and the Marketing Science Institute – explained how data, especially human-labeled data, drives AI progress. She also details why today’s data markets are both unfair and unsustainable. Here are the key lessons for business leaders.

Human-labeled data is still essential

Generative AI can create synthetic data, but it can’t fully replace human input. Without real examples created by people, LLMs struggle with specialized tasks like medical diagnosis or legal advice. As Li put it, “If you keep feeding a model with synthetic data, it can go into a bubble.” Businesses need to plan for the continued importance and cost of quality human-labeled datasets.

Today’s data markets underpay workers and hurt companies

Many data labelers earn about $2 an hour, even for complex tasks. This “gig work” system saves money in the short run but drives away skilled workers, lowering data quality. This, in turn, results in weaker AI models. “Even if you’re well-funded, you can’t find high-quality supplies,” Li warned.

Data should be priced by its true value

Li’s team tested ways to measure how much a dataset improves model performance. If adding a dataset boosts accuracy significantly, it should be priced higher than one that has little impact. Their simulations showed this approach raises pay for workers while giving buyers better data over time.

Transparency helps smaller players compete

Big tech companies can afford to buy up all kinds of data – even low-quality sets. Startups can’t. Transparent pricing helps smaller firms target the highest-value data and stretch their budgets further. Li called this “democratizing access,” since it prevents innovation from being concentrated in just a few large players.

Fairness and sustainability are on the line

Many data workers live in low-income countries and rarely see the value of their contributions. Fair pricing makes their work visible and combats what Li calls “data colonialism,” where richer nations take advantage of poorer regions. For policymakers, fairer pricing isn’t just about ethics, it’s about ensuring a steady supply of quality data to keep AI progress moving.

Both policymakers and companies have a role

Governments could adopt fair data pricing in public-sector projects like healthcare, while companies could test royalty-style models where labelers share in future economic value. Together, these steps could build a healthier data market.

Why It Matters

AI systems are only as good as the data they’re trained on. Companies that treat data labeling as cheap gig work risk undermining their own products. By paying fairly and transparently, organizations can get better-quality data, strengthen their AI models, and support a more sustainable and ethical AI ecosystem. In Li’s words, this isn’t just a fairness issue – it’s an “innovation issue.”

About Wharton AI & Analytics Insights

Wharton AI & Analytics Insights is a thought leadership series from the Wharton AI & Analytics Initiative. Featuring short-form videos and curated digital content, the series highlights cutting-edge faculty research and real-world business applications in artificial intelligence and analytics. Designed for corporate partners, alumni, and industry professionals, the series brings Wharton expertise to the forefront of today’s most dynamic technologies.