Comment by MrThoughtful

12 days ago

How hard would it be to create a "raw" model on a corpus like Hacker News or Wikipedia?

With "raw", I mean that it is simply trained to predict the next token and nothing else.

Would be fun to play with such a model.

9 comments

MrThoughtful

You want a pure-human training data set, so you have to go back in time to before 2020 to scrape training data. Either that, or only use data with a verified Wayback machine capture from before 2020. Or invent a new training regime that doesn't require gobs of stolen text.

Actually, I have a bit of a hunch that the publishers currently suing IA over their unlicensed digital library lending program plan to bankrupt it with fees so they can repo the Wayback archive and then sell access to it to AI training start-ups.

Anyway, the reason why you have to worry about all of that, is that training a text or image generator on the outputs of other text and image generators reduces output diversity. And lots of people are publishing their AI slop now. There's nothing inherent in the output of AI aside from the fact that AI content is easier to make than human; the problem is purely one of inflation and Sybil attacks. Think of membership in a training set like a vote for all the statistical patterns embedded in the image. AI generates output that is like the training data, so putting in a bunch of AI images is like stuffing the ballot box with whatever handful of statistical patterns were already well-learned, which shifts your AI from learning and generalizing to memorizing and infringing.

throwup238 12 days ago

You can just use Common Crawl. They have archives of their scrape data going back to 2008.

Mathnerd314 12 days ago

That's what the "base" models are, pure token prediction on huge corpuses. I use them a fair amount, it does require some experimentation to find input formats that work but the base models are way smarter and don't have any refusals. Honestly it is a bit weird, everyone complains about rhlf etc. but the non-instruct models are right there if you look for them. I've been in a few Discord chats and it seems people are just spoiled, they use bad formats for the prompts and give up when it doesn’t work the first time like with instruct.

jeroenvlek 12 days ago

The hard part would be to get the money for the needed compute, I presume. Although Karpathy just released a way to train a GPT2 level model for only 120 dollars [0]

[0] https://youtu.be/l8pRSuU81PU?si=NnbI-7CG-Qbm3E46

joaogui1 12 days ago

Depends on a ton of stuff really, like size of the model, how long do you want to train it for, what exactly do you mean by "like Hacker News or Wikipedia". Both Wikipedia and Hacker News are pretty small by current LLM training sets standards, so if you train only on for example a combination of these 2 you would likely end up with a model that lacks most capabilities we associate with large language models nowadays

fsmv 12 days ago

There are some that exist. The problem is you need at least some RLHF to make it follow instructions instead of just predicting sentences.

somebodythere 12 days ago

Instruction is not the only way to interact with an LLM. In tuning LLMs to the assistant persona, they become much less useful for a lot of tasks, like naming things or generating prose.

ronsor 12 days ago

If you used all of Wikipedia and HN, you could easily train a model for ~$200 worth of GPU time. The model really shouldn't be bigger than a few hundred million parameters for that quantity of data.

Der_Einzige 12 days ago

That's what every model was before rlhf! Go try GPT-2!