Experts warn that AI data is running out
As the use of artificial intelligence (AI) grows, experts have warned that the field may soon run out of training data, which makes AI systems work.
This might make the development of AI models, especially large language models, take longer. It might even change the way the AI revolution moves forward.
But why is it a problem that there might need to be more info when there is so much on the web? Is there a way to deal with the risk?
How important it is for AI to have good data
There must be a lot of data for us to teach AI algorithms that are strong, accurate, and highly skilled. In this case, 570 gigabytes of text data, or about 300 billion words, were used to train ChatGPT.
In the same way, the LIAON-5B dataset with 5.8 billion image-text pairs was used to train the steady diffusion algorithm, which is used by many AI image-making apps like DALL-E, Lensa, and Midjourney. If you introduce an algorithm on too little data, it will give wrong or low-quality results.
It's also vital that the training material is good. Finding low-quality data like blurry photos or social media posts is easy, but more is needed to teach AI models that work well.
Text from social media sites could be biased or prejudiced or contain false information or illegal material that the model could copy. For instance, when Microsoft used Twitter material to train its AI bot, it learned to say racist and sexist things.
This is why people who work on AI look for good content, like text from books, online articles, science papers, Wikipedia, and some web content that has been screened. The Google Assistant was trained on 11,000 romantic novels from the self-publishing site Smashwords to make it more conversational.
Are there enough facts?
The AI business has been using bigger and bigger datasets to train AI systems. We now have models like ChatGPT and DALL-E 3 that work well. But, study shows that the amount of data stored online is growing much more slowly than the amount used to train AI.
A group of researchers said in a paper released last year that if things keep going the way they are, we will run out of high-quality text material before 2026. They also thought low-quality picture data would run out between 2030 and 2060 and low-quality language data between 2030 and 2050.
PwC, an accounting and advising firm, says AI could add up to US$15.7 trillion (A$24.1 trillion) to the world economy by 2030. But, running out of data that can be used could slow its growth.
Are we going to be worried?
Some AI fans might be scared by these points, but things might not be as bad as they seem. There are a few things we need to learn about how AI models will change in the future, as well as some ways to deal with the risk of needing more data.
The makers of AI could make algorithms better, so they use the data they already have more effectively in one way.
In the next few years, they should be able to train AI systems that work well with less data and less computing power. This would also help AI leave less of a carbon impact.
You could also use AI to make fake data that systems can use to learn. This means that coders can easily make the data they need, ensuring it fits the needs of their specific AI model.
Many projects already use fake content, usually from services that make data like Mostly AI. In the future, this will happen more often.
It's not just free content that developers are looking for; they're also looking for material held by big publishers and in physical archives. Think about all the books that came out before the internet. They could give AI projects a new data source if they were made available online.
News Corp, which owns a lot of news material and puts much of it behind a paywall, recently said it was talking with AI developers about making content deals. AI firms would have to pay for training data, while up until now, they have mostly taken it for free from the internet.
People who make content have spoken out against companies like Microsoft, OpenAI, and Stability AI using their work without permission to train AI models. Some of these companies have even been sued. If talented people are paid for their work, it might help even out the power differences between them and AI companies.