← Back to context

Comment by gwern

8 months ago

If the Transformer was 'just' memorizing, you would expect width scaling to work much better than depth scaling (because width enables memorization much more efficiently), and you also wouldn't expect depth to run into problems, because it's not like memorization is that complex - but it does suggest that it's learning some more complicated algorithm which has issues with vanishing gradients & learning multiple serial steps, and the obvious complicated algorithm to be learning in this context would be an implicit search akin to the MuZero RNN (which, incidentally, doesn't need any symbolic solver like Stockfish to learn superhuman chess from scratch by self-play).