Generative models trained on the data they generate tend to deteriorate over time, forgetting the true underlying data distribution. This phenomenon, known as “model collapse,” leads to models over-representing common events and forgetting less frequent but important events. As the majority of training data comes from the internet, the risk of deterioration increases if human-generated content becomes scarce. To mitigate this risk, measures can be taken to encourage content creation without AI tools and develop methods to detect and filter out AI-generated data during the training process. Techniques like watermarking and machine learning classifiers have been explored, but they have limitations. Another approach involves observing the curvature of the log probability function of generative models to distinguish AI-generated text from human-generated text. The challenge lies in establishing reliable and scalable methods for detecting AI-generated content.
What Happens When Most Online Content Becomes AI-Generated?
Generative AI models have revolutionized content creation by producing highly realistic and complex text, image, and sound. However, there is a concern that these models may deteriorate when trained on the data they generate. This can lead to a phenomenon called “model collapse,” where models forget the true underlying data distribution.
Why Does Model Collapse Happen?
Researchers have found that when models are primarily trained on the content they generate, they tend to forget the tails of real distributions and over-represent the center. This means that they lose the ability to generate improbable and less frequent events, resulting in a distribution that is different from the original one.
The Risk of AI-Dominated Content
If the majority of online content becomes AI-generated, there is a risk that generative models will be exposed to synthetic data, leading to further deterioration. This raises the question of how to mitigate this risk and maintain the performance of generative models.
Possible Solutions
There are two approaches to address this issue:
- Promoting Human Content Creation: Encouraging creators to use generative models less frequently or in fewer contexts. However, verifying whether content is genuinely human-generated is a challenge.
- Detecting AI-Generated Data: Developing methods to distinguish human-generated from AI-generated data during the model training process. Techniques such as watermarking and machine learning classifiers have been explored, but they have limitations.
How to Detect AI-Generated Data
One approach is to use watermarking, which adds hidden signals to the data that are detectable by algorithms. However, the adoption of watermarking by AI providers is currently limited.
Another approach is to train machine learning classifiers to label AI-generated or human-generated content. However, these classifiers have a high error rate and are not yet a robust solution.
A zero-shot approach based on observing the curvature of the log probability function of large language models (LLMs) shows promise in detecting AI-generated text.
Conclusion
Training generative models on human-generated data is crucial to avoid performance decline. As more online content becomes AI-generated, it is important to find ways to incentivize human content creation and develop reliable methods for detecting AI-generated content.
To stay competitive and leverage AI for your company, consider implementing AI solutions that automate customer engagement and improve sales processes. Connect with us at hello@itinai.com for AI KPI management advice and explore our AI Sales Bot at itinai.com/aisalesbot.