The problem is that the data types typically used for training language models could be exhausted in the near future, as early as 2026. according to an article by Epoch researchers, an AI research and forecasting organization, which has yet to be peer-reviewed. The problem is that as researchers build more powerful models with greater capabilities, they have to find more and more texts to train them on. Leading language model researchers are increasingly concerned that they will run out of this kind of data, says Teven Le Scao, a researcher at artificial intelligence firm Hugging Face, who was not involved in Epoch’s work.
Part of the problem stems from the fact that linguistic AI researchers filter the data they use to train models into two categories: high and low quality. The line between the two categories can be blurred, says Pablo Villalobos, a researcher at Epoch and lead author of the paper, but the text in the former is considered better written and is often produced by professional writers.
Data in low-quality categories consists of text such as social media posts or comments on websites like 4chan, and far exceeds data considered high-quality. Researchers typically only train models using data that falls into the high-quality category, because that’s the kind of language they want the models to reproduce. This approach has given impressive results for large language models such as GPT-3.
According to Swabha Swayamdipta, a professor of machine learning at the University of Southern California who specializes in the quality of datasets, one way to overcome these data constraints would be to reassess what is defined as “low” and “high.” quality. If data shortages push AI researchers to incorporate more diverse datasets into the training process, that would be a “net positive” for language models, Swayamdipta says.
Researchers can also find ways to extend the life of data used for training language models. Currently, large language models are only trained once on the same data, due to performance and cost constraints. But it may be possible to train a model multiple times using the same data, Swayamdipta says.
Some researchers believe that big does not equal better when it comes to language models anyway. Percy Liang, professor of computer science at Stanford University, says there is evidence that making models more efficient can improve their capacity, rather than just increase their size.
“We’ve seen how smaller models trained on higher quality data can outperform larger models trained on lower quality data,” he explains.