Technology

AI language models are running out of human-written text to learn from

Published

2 years ago

June 6, 2024

AI language models are running out of human-written text to learn from

A new study released by research group Epoch AI projects that tech companies will exhaust the supply of publicly available training data for AI language models by sometime between 2026 and 2032.
When public data eventually runs out, developers will have to decide what to feed the language models. Ideas include data now considered private, like emails or text messages, and using “synthetic data” created by other AI models.
Besides training larger and larger models, another path to pursue is building more skilled training models that are specialized for specific tasks.

Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter — the tens of trillions of words people have written and shared online.

A new study released Thursday by research group Epoch AI projects that tech companies will exhaust the supply of publicly available training data for AI language models by roughly the turn of the decade — sometime between 2026 and 2032.

Comparing it to a “literal gold rush” that depletes finite natural resources, Tamay Besiroglu, an author of the study, said the AI field might face challenges in maintaining its current pace of progress once it drains the reserves of human-generated writing.

YELLEN TO WARN OF ‘SIGNIFICANT RISKS’ OF AI IN FINANCE WHILE ACKNOWLEDGING ‘TREMENDOUS OPPORTUNITIES’

In the short term, tech companies like ChatGPT-maker OpenAI and Google are racing to secure and sometimes pay for high-quality data sources to train their AI large language models – for instance, by signing deals to tap into the steady flow of sentences coming out of Reddit forums and news media outlets.

In the longer term, there won’t be enough new blogs, news articles and social media commentary to sustain the current trajectory of AI development, putting pressure on companies to tap into sensitive data now considered private — such as emails or text messages — or relying on less-reliable “synthetic data” spit out by the chatbots themselves.

“There is a serious bottleneck here,” Besiroglu said. “If you start hitting those constraints about how much data you have, then you can’t really scale up your models efficiently anymore. And scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output.”

Artificial intelligence systems like ChatGPT are consuming ever-larger collections of human writings that they need to get smarter. (AP Digital Embed)

The researchers first made their projections two years ago — shortly before ChatGPT’s debut — in a working paper that forecast a more imminent 2026 cutoff of high-quality text data. Much has changed since then, including new techniques that enabled AI researchers to make better use of the data they already have and sometimes “overtrain” on the same sources multiple times.

But there are limits, and after further research, Epoch now foresees running out of public text data sometime in the next two to eight years.

The team’s latest study is peer-reviewed and due to be presented at this summer’s International Conference on Machine Learning in Vienna, Austria. Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of effective altruism — a philanthropic movement that has poured money into mitigating AI’s worst-case risks.

Besiroglu said AI researchers realized more than a decade ago that aggressively expanding two key ingredients — computing power and vast stores of internet data — could significantly improve the performance of AI systems.

The amount of text data fed into AI language models has been growing about 2.5 times per year, while computing has grown about 4 times per year, according to the Epoch study. Facebook parent company Meta Platforms recently claimed the largest version of their upcoming Llama 3 model — which has not yet been released — has been trained on up to 15 trillion tokens, each of which can represent a piece of a word.

But how much it’s worth worrying about the data bottleneck is debatable.

“I think it’s important to keep in mind that we don’t necessarily need to train larger and larger models,” said Nicolas Papernot, an assistant professor of computer engineering at the University of Toronto and researcher at the nonprofit Vector Institute for Artificial Intelligence.

Papernot, who was not involved in the Epoch study, said building more skilled AI systems can also come from training models that are more specialized for specific tasks. But he has concerns about training generative AI systems on the same outputs they’re producing, leading to degraded performance known as “model collapse.”

7 THINGS GOOGLE JUST ANNOUNCED THAT ARE WORTH KEEPING A CLOSE EYE ON

Training on AI-generated data is “like what happens when you photocopy a piece of paper and then you photocopy the photocopy. You lose some of the information,” Papernot said. Not only that, but Papernot’s research has also found it can further encode the mistakes, bias and unfairness that’s already baked into the information ecosystem.

If real human-crafted sentences remain a critical AI data source, those who are stewards of the most sought-after troves — websites like Reddit and Wikipedia, as well as news and book publishers — have been forced to think hard about how they’re being used.

“Maybe you don’t lop off the tops of every mountain,” jokes Selena Deckelmann, chief product and technology officer at the Wikimedia Foundation, which runs Wikipedia. “It’s an interesting problem right now that we’re having natural resource conversations about human-created data. I shouldn’t laugh about it, but I do find it kind of amazing.”

While some have sought to close off their data from AI training — often after it’s already been taken without compensation — Wikipedia has placed few restrictions on how AI companies use its volunteer-written entries. Still, Deckelmann said she hopes there continue to be incentives for people to keep contributing, especially as a flood of cheap and automatically generated “garbage content” starts polluting the internet.

AI companies should be “concerned about how human-generated content continues to exist and continues to be accessible,” she said.

From the perspective of AI developers, Epoch’s study says paying millions of humans to generate the text that AI models will need “is unlikely to be an economical way” to drive better technical performance.

As OpenAI begins work on training the next generation of its GPT large language models, CEO Sam Altman told the audience at a United Nations event last month that the company has already experimented with “generating lots of synthetic data” for training.

“I think what you need is high-quality data. There is low-quality synthetic data. There’s low-quality human data,” Altman said. But he also expressed reservations about relying too heavily on synthetic data over other technical methods to improve AI models.

“There’d be something very strange if the best way to train a model was to just generate, like, a quadrillion tokens of synthetic data and feed that back in,” Altman said. “Somehow that seems inefficient.”

Technology

Use this map to find the data centers in your backyard

Published

10 hours ago

May 14, 2026

Press Room

Use this map to find the data centers in your backyard

When Oregon resident Isabelle Reksopuro heard Google was gobbling up public land to fuel its data centers in her home state, she didn’t initially know what to believe. “There’s a lot of misinformation about data centers,” she said. “Google has denied taking that land.”

Technically, she explains, The Dalles, a city near the Washington state border, sought to reclaim that land, “and Google is just a big, unnamed power user.” The city had in fact asked for ownership of a 150-acre portion of Mount Hood National Forest, claiming it needs access to Mount Hood’s watershed to meet municipal needs as its population — 16,010 as of the 2020 census — grows. But critics, including environmentalists, say the city is trying to secure more water for Google, which has a sprawling data center campus in The Dalles that already consumes about one-third of the city’s water supply.

This controversy made Reksopuro curious about the backlash to data centers being built in other communities. So Reksopuro, a student at the University of Washington who studies the connections between tech and public policy, decided to map it out. Using information collected by Epoch AI and data scraped from legislation on data centers, she built an interactive map tracking AI policy around the world. She designed it to be simple enough for anyone to use. “I wanted it to be something that my younger sisters could play through and explore to understand what are the data centers in the area and what’s actually being done about it,” Reksopuro said. She hoped to shift their opinions that way, “instead of like, through TikTok.”

Four times a day, the map searches for new sources and checks them against the existing database Reksopuro built out. “Once it does that, it will write a new summary, add it to the news feed, and populate it on the sidebar,” she said. “I wanted it to be self-updating, since I’m also a student.”

Reksopuro isn’t against data centers, but she thinks tech giants benefit from a lack of transparency around data center policies. “Right now, it’s this really opaque thing — and all of a sudden, there’s a facility,” she said. “I think that if people knew about data centers beforehand, it would give them leverage. They would be able to negotiate: ask for job training programs, tax revenue, environmental monitoring, things to improve their community.”

Technology

Fox News AI Newsletter: Graduation speaker praises AI, gets instantly booed

Published

11 hours ago

May 14, 2026

Press Room

UCF commencement speaker Gloria Caulfield (University of Central Florida via Storyful)

NEWYou can now listen to Fox News articles!

Welcome to Fox News’ Artificial Intelligence newsletter with the latest AI technology advancements.

IN TODAY’S NEWSLETTER:

– UCF graduates clobber commencement speaker with boos after she says AI is the ‘next Industrial Revolution’

– OPINION: DIRECTOR KASH PATEL: We brought the FBI out of the past and into the AI age

– OpenAI backs creation of global AI governance body led by the U.S. that would include China as a member

TOUGH CROWD: During a recent commencement ceremony at the University of Central Florida, a speaker was met with loud boos from the graduating class after declaring that artificial intelligence represents the next industrial revolution. Fox News Digital reporting captures this tense cultural moment, illustrating the mixed public sentiment and skepticism surrounding AI’s growing footprint in daily life.

A statue on the campus of the University of Central Florida in Orlando, Florida. (iStock)

BADGE MEETS BYTE: Reflecting on the modernization of national security in a Fox News op-ed, FBI Director Kash Patel explores how the bureau must adapt its strategies to address modern threats and advance beyond the artificial intelligence age.

TECH DIPLOMACY: OpenAI is throwing its support behind the establishment of a new global artificial intelligence governance organization that would be led by the United States while notably including China as a member. Fox News Digital reporting examines the geopolitical dynamics and regulatory implications of this proposed framework as global powers race to set the standards for AI development.

EQUITY ELEVATION: The massive wave of wealth generated by the explosive growth of ChatGPT and the broader AI industry is driving a sudden surge in the San Francisco Bay Area’s luxury real estate market. Fox News Digital reporting breaks down how the influx of new tech capital is reshaping local housing dynamics and fueling a high-end property frenzy.

FBI Director Kash Patel listened as Acting Attorney General Todd Blanche spoke during a press conference at the Department of Justice on April 28, 2026, in Washington, D.C. (Tasos Katopodis/Getty Images)

STRATEGY RESET: Tech giant Cisco is planning to eliminate thousands of jobs as the company shifts its primary focus to accelerate its artificial intelligence initiatives, a move that comes despite the company beating earnings expectations. Fox News Digital reporting details the corporate restructuring and broader economic trends pushing legacy tech firms to aggressively pivot toward AI.

ROAD HAZARD: Waymo is issuing a sweeping recall of its autonomous vehicle fleet following a concerning incident that highlighted significant safety issues with the self-driving technology. Fox News Digital reporting outlines the specifics of the recall, the nature of the safety flaw, and what this setback means for the future of fully autonomous transportation on public roads.

BOTS IN THE BAY: A newly developed, artificial intelligence-powered robot has been engineered to seamlessly change and balance vehicle tires without human intervention. Fox News Digital reporting showcases this latest innovation, exploring how automation and AI mechanics could soon revolutionize the automotive service and repair industry.

OpenAI CEO Sam Altman speaks during the 2026 Infrastructure Summit in Washington, D.C., on March 11, 2026. (Kylie Cooper/Reuters)

FOLLOW FOX NEWS ON SOCIAL MEDIA

Facebook

Instagram

YouTube

Twitter

SIGN UP FOR OUR OTHER NEWSLETTERS

Fox News First

Fox News Opinion

Fox News Lifestyle

Fox News Health

DOWNLOAD OUR APPS

Fox News

FOX Business

Fox Weather

Fox Sports

Tubi

WATCH FOX NEWS ONLINE

Fox News Go

STREAM FOX NATION

Fox Nation

Stay up to date on the latest AI technology advancements and learn about the challenges and opportunities AI presents now and for the future with Fox News here.

This article was written by Fox News staff.

Technology

Microsoft’s Edge Copilot update uses AI to pull information from across your tabs

Published

23 hours ago

May 14, 2026

Press Room

Microsoft’s Edge Copilot update uses AI to pull information from across your tabs

Microsoft Edge is adding a new feature that will allow its Copilot AI chatbot to gather information from all of your open tabs. When you start a conversation with Copilot, you can ask the chatbot questions about what’s in your tabs, compare the products you’re looking at, summarize your open articles, and more.

In its announcement, Microsoft says you can “select which experiences you want or leave off the ones you don’t.” The company is retiring Copilot Mode as well, which could similarly draw information from your tabs but offered some agentic features, like the ability to book a reservation on your behalf. Microsoft has since folded these agentic capabilities into its “Browse with Copilot” tool.

Several other AI features are coming to Edge, including an AI-powered “Study and Learn” mode that can turn the article you’re looking at into a study session or interactive quiz. There’s a new tool that turns your tabs into AI-powered podcasts as well, similar to what you’d find on NotebookLM, and an AI writing assistant that will pop up when you start entering text on a webpage.

You can also give Copilot permission to access your browsing history to provide more “relevant, high-quality answers,” according to Microsoft. Copilot in Edge on desktop and mobile will come with “long-term memory” as well, which can tailor its responses based on your previous conversations. And, when you open up a new tab, you’ll see a redesigned page that combines chat, search, and web navigation, along with the Journeys feature, which uses AI to organize your browsing history into categories that you can revisit.

Meanwhile, an update to Edge’s mobile app will allow you to share your screen with Copilot and talk through the questions about what you’re seeing. Microsoft says you’ll see “clear visual cues” when Copilot is active, “so you know when it’s taking an action, helping, listening, or viewing.”