Model Collapse and the Need for Human-Generated Training Data

16 points by glth 2 days ago

sans_souse 2 days ago

I think a point often missed is that it's not just what the substance and quality of those sources and their associated decline but also the overall decline of sources, period. The first phases involved training the models with a massive backlog of raw knowlege, communicated over thousands of years, and for the majority of that span, this was a world much different from our today; in short, all of our knowledge was "boots on the ground" type, and all of it served to aid our growth, and our record of this tells such a story.

But our knowledge and growth today is so narrow in scope (in a sense) and there's an ever looming scenario ready to present itself where our perceived growth is actually a recursion and the answer to "what is the purpose" becomes "there is none"

ipython 2 days ago

So I’ve heard of this model collapse theory. But I’ve also heard of model providers who are intentionally training with synthetically generated data (as a result of insufficient “real” data).

So I’m curious where the line is? Are there phases in the training/continued pre training/alignment/rlhf pipeline where synthetic data isn’t just harmless but actually beneficial? Is it a question of quantity or a question of how much novelty is in the training data?

dinfinity a day ago

Let's remember that the basic defense against model collapse is just not training on AI-generated and other crap data.

Sure, there are places where determining whether it is AI-generated or 'real' is hard, but there are plenty of places where the trust in the provider provides enough basis to include the data during curation. For example, it's not as if the NYT will suddenly start pumping out unchecked AI slop.

And then there is the enormous potential of data synthesized aided by, but not completely generated by AI and validated for accuracy through systematic means.