Comment by QuadmasterXLII
15 hours ago
I’m really surprised by the performance of the plain C++ version. Is automatic vectorization turned off? Frankly this task is so common that I would half expect compilers to have a hard coded special case specifically for fast dot products
Edit: Yeah, when I compile the “plain c++” with clang the main loop is 8 vmovups, 16 vfmadd231ps, and an add cmp jne. OP forgot some flags.
which flags did you use and which compiler version?
clang 19, -O3 -ffast-math -march=native
can confirm fast math makes the biggest difference
3 replies →