Comment by ashvardanian

12 hours ago

The conclusion of the article isn't entirely accurate.

> Why? Because they are calculated using the BLAS library available in the OS, which means that not even writing C++ SIMD code will make me have a faster implementation than the one Python is using and I will probably have to write my own assembly code with compiler-like tricks to go as fast as Python plus C++ libraries.

Switching from C++ SIMD intrinsics to assembly won't yield significant performance improvements for cosine distance. The kernel is typically too small for reordering tricks to have much impact. In fact, you can already outperform both NumPy and SciPy with optimized intrinsics, as I discussed in a previous HN post (https://github.com/ashvardanian/simsimd?tab=readme-ov-file#c...). There's also a promising RFC in SciPy to allow pluggable backends, which could make integrating such kernels easier (https://github.com/scipy/scipy/issues/21645#issuecomment-239...).

For many-to-many distance calculations on low-dimensional vectors, the bottleneck is often in the reciprocal square roots. Using LibC is slow but highly accurate. Intrinsics are faster, but you'll need several Newton-Raphson iterations for precision, and the number of iterations varies between x86 and Arm architectures (https://github.com/ashvardanian/simsimd?tab=readme-ov-file#c...). When dealing with high-dimensional vectors, you start competing with BLAS implementations, though they, too, have limitations on newer hardware or in mixed precision scenarios.

2 comments

ashvardanian

Const-me 11 hours ago

> Intrinsics are faster, but you'll need several Newton-Raphson iterations for precision

I wonder have you tried non-approximated intrinsics like _mm_div_ps( mul, _mm_sqrt_ps( div2 ) ) ?

The reason standard library is so slow is exception handling and other edge cases. On modern CPUs normal non-approximated FP division and square root instructions aren’t terribly slow, e.g. on my computer FP32 square root instruction has 15 cycles latency and 0.5 cycles throughput.

adgjlsfhk1 10 hours ago

yeah you generally can't approximate sqrt faster than computing it. sqrt is generally roughly as fast as division.