Comment by mkristiansen

15 hours ago

This is really interesting. I have a couple of questions, mainly from the fact that the code is c++ code is about 2x slower than then Numpy.

I had a look at the assembly generated, both in your repo, and from https://godbolt.org/z/76K1eacsG

if you look at the assembly generated:

        vmovups ymm3, ymmword ptr [rdi + 4*rcx]

        vmovups ymm4, ymmword ptr [rsi + 4*rcx]

        add     rcx, 8

        vfmadd231ps     ymm2, ymm3, ymm4

        vfmadd231ps     ymm1, ymm3, ymm3

        vfmadd231ps     ymm0, ymm4, ymm4

        cmp     rcx, rax

        jb      .LBB0_10

        jmp     .LBB0_2

you are only using 5 of the sse2 registers(ymm0 -- ymm4) before creating a dependency on one of the (ymm0 -- ymm2) being used for the results.

I Wonder if widening your step size to contain more than one 256bit register might get you the speed up. Something like this (https://godbolt.org/z/GKExaoqcf) to get more of the sse2 registers in your CPU doing working.

        vmovups ymm6, ymmword ptr [rdi + 4*rcx]

        vmovups ymm8, ymmword ptr [rsi + 4*rcx]

        vmovups ymm7, ymmword ptr [rdi + 4*rcx + 32]

        vmovups ymm9, ymmword ptr [rsi + 4*rcx + 32]

        add     rcx, 16

        vfmadd231ps     ymm5, ymm6, ymm8

        vfmadd231ps     ymm4, ymm7, ymm9

        vfmadd231ps     ymm3, ymm6, ymm6

        vfmadd231ps     ymm2, ymm7, ymm7

        vfmadd231ps     ymm1, ymm8, ymm8

        vfmadd231ps     ymm0, ymm9, ymm9

        cmp     rcx, rax

        jb      .LBB0_10

        jmp     .LBB0_2

which ends up using 10 of the registers, allowing for 6 fused multiplies, rather than 3, before creating a dependency on a previous result -- you might be able to create a longer list.