Comment by og_kalu

8 months ago

Well without explicit search would probably be more accurate.

They note that though in the paper:

>Since transformers may learn to roll out iterative computation (which arises in search) across layers, deeper networks may hold the potential for deeper unrolls.

We don’t know if it’s using implicit search either. While it would be interesting if the network was doing some internal search, it’s also possible it has just memorized the evaluations from 10M games and is performing some function of the similarity of the input to those previously seen.

  • Maybe they could test it on "tricky" positions where increasing the search depth on Stockfish dramatically changes the evaluation.

  • Even if it's "implicit" I'm not sure if that matters that much. The point is that the model doesn't explicitly search anything, it just applies the learned transformation. If the weights of the learned transformation encode a sort of precomputed search and interpolation over the dataset, from an algorithmic perspective this still isn't search (it doesn't enumerate board states or state-action transitions).

    >performing some function of the similarity of the input to those previously seen.

    This is indeed what transformers do. But obviously it learns some sort of interpolation/extrapolation which lets it do well on board states/games outside the training set.

    • I agree, from a practical perspective what matters is that is distilling stockfish’s search to a certain extent, which could have computational efficiencies. “Without search” just means we’re not doing anything like minimax or MCTS.

  • If the Transformer was 'just' memorizing, you would expect width scaling to work much better than depth scaling (because width enables memorization much more efficiently), and you also wouldn't expect depth to run into problems, because it's not like memorization is that complex - but it does suggest that it's learning some more complicated algorithm which has issues with vanishing gradients & learning multiple serial steps, and the obvious complicated algorithm to be learning in this context would be an implicit search akin to the MuZero RNN (which, incidentally, doesn't need any symbolic solver like Stockfish to learn superhuman chess from scratch by self-play).

  • >We don’t know if it’s using implicit search either.

    Sure

    >it’s also possible it has just memorized the evaluations from 10M games and is performing some function of the similarity of the input to those previously seen.

    That's not possible. The possible set of moves in chess is incredibly large and it is incredibly easy to play a game that has diverged from training. a model that has just memorized all evaluations would break within ten or so moves tops much less withstand robust evaluations.

    However this model may work exactly and how much or little it relies on search is unknown but it is no doubt a model of the world of chess. https://adamkarvonen.github.io/machine_learning/2024/01/03/c...

    • If it could reliably win a mate in N position without inexplicably blundering, I would be more inclined to buy your search hypothesis. But it doesn’t, which is one of the reasons the authors gave for finishing with stockfish. So whatever it’s doing is clearly lossy which an actual search would not be.

      Neural nets memorize all sorts of things. They memorize ad clicks in high dimensional state spaces. Transformers trained on the whole internet can often reproduce entire texts. It’s lossy, but it’s still memorizing.

      That seems like the simplest explanation for what’s happening here. There’s some sort of lossy memorization, not a search. The fact that the thing it has memorized is the result of a search doesn’t matter.

      1 reply →