← Back to context

Comment by neonsunset

13 hours ago

> but unless you opt to implement a processor-specific calculation in C++

Not necessarily true if you use C# (or Swift or Mojo) instead:

    static float CosineSimilarity(
        ref float a,
        ref float b,
        nuint length
    ) {
        var sdot = Vector256<float>.Zero;
        var sa = Vector256<float>.Zero;
        var sb = Vector256<float>.Zero;

        for (nuint i = 0; i < length; i += 8) {
            var bufa = Vector256.LoadUnsafe(ref a, i);
            var bufb = Vector256.LoadUnsafe(ref b, i);

            sdot = Vector256.FusedMultiplyAdd(bufa, bufb, sdot);
            sa = Vector256.FusedMultiplyAdd(bufa, bufa, sa);
            sb = Vector256.FusedMultiplyAdd(bufb, bufb, sb);
        }

        var fdot = Vector256.Sum(sdot);
        var fanorm = Vector256.Sum(sa);
        var fbnorm = Vector256.Sum(sb);

        return fdot / MathF.Sqrt(fanorm) * MathF.Sqrt(fbnorm);
    }

Compiles to appropriate codegen quality: https://godbolt.org/z/hh16974Gd, on ARM64 it's correctly unrolled to 128x2

Edit: as sibling comment mentioned, this benefits from unrolling, which would require swapping 256 with 512 and += 8 with 16 in the snippet above, although depending on your luck Godbolt runs this on CPU with AVX512 so you don't see the unrolling as it just picks ZMM registers supported by the hardware instead :)