Comment by Const-me

13 hours ago

> which means that not even writing C++ SIMD code will make me have a faster implementation than the one Python is using and I will probably have to write my own assembly code

I believe assembly is almost always the wrong choice in modern world. It’s just that your SIMD version is not very efficient.

Your original SIMD version completes in 0.065ms on my computer. Here’s an optimized version which completes in 0.038ms i.e. 1.7x faster: https://gist.github.com/Const-me/41b013229b20f920bcee22a856c... Note I have used 4 sets of the accumulators to workaround relatively high latency of the FMA instructions.

However, I’m pretty sure the implementation used by these Python libraries is leveraging multiple CPU cores under the hood. Here’s another C++ version which does that as well, it completed in 0.0136 ms on my computer i.e. 4.8x faster: https://gist.github.com/Const-me/c61e836bed08cef2f06783c7b11...