← Back to context

Comment by ansk

12 hours ago

Key-based attention is not attributable to the Transformer paper. First paper I can find where keys, queries, and values are distinct matrices is https://arxiv.org/abs/1703.03906, described at the end of section 2. The authors of the Transformer paper are very clear in how they describe their contribution to the attention formulation, writing "Dot-product attention is identical to our algorithm, except for the scaling factor". I think it's fair to state that multi-head is the paper's only substantial contribution to the design of attention mechanisms.

I think you're overestimating the degree to which this type of research is motivated by big-picture, top-down thinking. In reality, it's a bunch of empirically-driven, in-the-weeds experiments that guide a very local search in a intractably large search space. I can just about guarantee the process went something like this:

- The authors begin with an architecture similar to the current SOTA, which was a mix of recurrent layers and attention

- The authors realize that they can replace some of the recurrent layers with attention layers, and performance is equal or better. It's also way faster, so they try to replace as many recurrent layers as possible.

- They realize that if they remove all the recurrent layers, the model sucks. They're smart people and they quickly realize this is because the attention-only model is invariant to sequence order. They add positional encodings to compensate for this.

- They keep iterating on the architecture design, incorporating best-practices from the computer vision community such as normalization and residual connections, resulting in the now-famous Transformer block.

At no point is any stroke of genius required to get from the prior SOTA to the Transformer. It's the type of discovery that follows so naturally from an empirically-driven approach to research that it feels all but inevitable.