• UAE
  • Pakistan
  • World
  • Health
  • Fitness
  • Automobile
  • Technology – IT World
  • Cryptocurrency
Facebook Twitter Instagram
Trending
  • New book in UAE inspired by “Document on Human Brotherhood”
  • A BMW 2002 saved the life of Harry Hamlin
  • New US childhood obesity guidelines criticized by families
  • Crypto Association in Turkey Vows to Block Exchanges That ‘Victimize Traders’ CryptoBlog
  • Afghanistan to tour the United Arab Emirates for the three-game T20 series from February 16
  • Befriend India forgetting Kashmir – UAE, Saudi Arabia advise Pakistan – Zee News
  • OpenWeb, which helps outlets target readers with ads and manage comments, is acquiring Jeeng, an audience management service used by more than 650 publishers, for $100 million (CTech)
  • Long queues at petrol pumps across Pakistan amid price hike fears: Report
  • UAE
  • Pakistan
  • World
  • Health
  • Fitness
  • Automobile
  • Technology – IT World
  • Cryptocurrency
24/7 News
Sunday, January 29
  • UAE
  • Pakistan
  • World
  • Health
  • Fitness
  • Automobile
  • Technology – IT World
  • Cryptocurrency
24/7 News
Home»Tech

We might run out of data to train AI language programs

November 24, 2022 Tech No Comments3 Mins Read

The problem is that the data types typically used for training language models could be exhausted in the near future, as early as 2026. according to an article by Epoch researchers, an AI research and forecasting organization, which has yet to be peer-reviewed. The problem is that as researchers build more powerful models with greater capabilities, they have to find more and more texts to train them on. Leading language model researchers are increasingly concerned that they will run out of this kind of data, says Teven Le Scao, a researcher at artificial intelligence firm Hugging Face, who was not involved in Epoch’s work.

Part of the problem stems from the fact that linguistic AI researchers filter the data they use to train models into two categories: high and low quality. The line between the two categories can be blurred, says Pablo Villalobos, a researcher at Epoch and lead author of the paper, but the text in the former is considered better written and is often produced by professional writers.

Data in low-quality categories consists of text such as social media posts or comments on websites like 4chan, and far exceeds data considered high-quality. Researchers typically only train models using data that falls into the high-quality category, because that’s the kind of language they want the models to reproduce. This approach has given impressive results for large language models such as GPT-3.

According to Swabha Swayamdipta, a professor of machine learning at the University of Southern California who specializes in the quality of datasets, one way to overcome these data constraints would be to reassess what is defined as “low” and “high.” quality. If data shortages push AI researchers to incorporate more diverse datasets into the training process, that would be a “net positive” for language models, Swayamdipta says.

Researchers can also find ways to extend the life of data used for training language models. Currently, large language models are only trained once on the same data, due to performance and cost constraints. But it may be possible to train a model multiple times using the same data, Swayamdipta says.

Some researchers believe that big does not equal better when it comes to language models anyway. Percy Liang, professor of computer science at Stanford University, says there is evidence that making models more efficient can improve their capacity, rather than just increase their size.
“We’ve seen how smaller models trained on higher quality data can outperform larger models trained on lower quality data,” he explains.

Keep Reading

OpenWeb, which helps outlets target readers with ads and manage comments, is acquiring Jeeng, an audience management service used by more than 650 publishers, for $100 million (CTech)

Microsoft, GitHub and OpenAI ask court to dismiss AI copyright lawsuit

‘The Last of Us’ release schedule: When is episode 3 out on HBO Max?

Stripe Considers Exit, Dell Bets on the Cloud, and Shutterstock Embraces Generative AI • TechCrunch

Q&A with Princeton CS Professor Arvind Narayanan on Why He Calls ChatGPT a "bullshit generator"its concerns about its boom, the development of its AI taxonomy, and more (Julia Angwin/The Markup)

DC Comics Announces New Titans Comic

Add A Comment

Leave A Reply Cancel Reply

Latest

New book in UAE inspired by “Document on Human Brotherhood”

January 29, 2023

A BMW 2002 saved the life of Harry Hamlin

January 29, 2023

New US childhood obesity guidelines criticized by families

January 29, 2023
Categories
  • Automobile (1,551)
  • Cryptocurrency (1,469)
  • Fitness (639)
  • Health (679)
  • Pakistan (1,620)
  • Tech (1,515)
  • UAE (4,974)
  • World (1,580)
Other News
  • UAE
  • Pakistan
  • World
  • Health
  • Fitness
  • Automobile
  • Technology – IT World
  • Cryptocurrency
Trending News

New book in UAE inspired by “Document on Human Brotherhood”

January 29, 2023

A BMW 2002 saved the life of Harry Hamlin

January 29, 2023

New US childhood obesity guidelines criticized by families

January 29, 2023
World News Catch Up

Crypto Association in Turkey Vows to Block Exchanges That ‘Victimize Traders’ CryptoBlog

Cryptocurrency January 29, 2023

A new organization has been created in Turkey with the aim of monitoring and helping…

Elon Musk reaffirms his offer to eat a Happy Meal on TV if McDonald’s accepts Dogecoin Cryptocurrency

January 29, 2023

Robert Kiyosaki Says ‘We’re in a Global Recession’ – Warns of Soaring Bankruptcies, Unemployment and Homelessness

January 29, 2023
© 2023 Designed by gulfnews .

Type above and press Enter to search. Press Esc to cancel.