Comment by QuadmasterXLII

13 hours ago

I’m really surprised by the performance of the plain C++ version. Is automatic vectorization turned off? Frankly this task is so common that I would half expect compilers to have a hard coded special case specifically for fast dot products

Edit: Yeah, when I compile the “plain c++” with clang the main loop is 8 vmovups, 16 vfmadd231ps, and an add cmp jne. OP forgot some flags.