Comment by kmeisthax

12 days ago

You want a pure-human training data set, so you have to go back in time to before 2020 to scrape training data. Either that, or only use data with a verified Wayback machine capture from before 2020. Or invent a new training regime that doesn't require gobs of stolen text.

Actually, I have a bit of a hunch that the publishers currently suing IA over their unlicensed digital library lending program plan to bankrupt it with fees so they can repo the Wayback archive and then sell access to it to AI training start-ups.

Anyway, the reason why you have to worry about all of that, is that training a text or image generator on the outputs of other text and image generators reduces output diversity. And lots of people are publishing their AI slop now. There's nothing inherent in the output of AI aside from the fact that AI content is easier to make than human; the problem is purely one of inflation and Sybil attacks. Think of membership in a training set like a vote for all the statistical patterns embedded in the image. AI generates output that is like the training data, so putting in a bunch of AI images is like stuffing the ballot box with whatever handful of statistical patterns were already well-learned, which shifts your AI from learning and generalizing to memorizing and infringing.

1 comment

kmeisthax

throwup238 12 days ago

You can just use Common Crawl. They have archives of their scrape data going back to 2008.