Comment by m3kw9

13 hours ago

How do you explain a horse 2 legs become 4 legs when rotated assuming they only drew 2 legs on the side view

3 comments

m3kw9

atq2119 13 hours ago

The second L in LLM stands for "language". Nothing of what you're describing has to do with language modeling.

They could be using transformers, sure. But plenty of transformers-based models are not LLMs.

kubrickslair 12 hours ago

They are probably looking for LGMs - Large Generative Models which encapsulate vision & multi-modal models.

stevenhuang 40 minutes ago

The model need only recognize from the shape that it is a horse, and would know to extrapolate from there. It would presumably have some text encoding as residual from training, but it doesn't need to be fed text from the text encoder side to know that. Think of the CLIP encoder used in stable diffusion.