Comment by Const-me

13 hours ago

> Intrinsics are faster, but you'll need several Newton-Raphson iterations for precision

I wonder have you tried non-approximated intrinsics like _mm_div_ps( mul, _mm_sqrt_ps( div2 ) ) ?

The reason standard library is so slow is exception handling and other edge cases. On modern CPUs normal non-approximated FP division and square root instructions aren’t terribly slow, e.g. on my computer FP32 square root instruction has 15 cycles latency and 0.5 cycles throughput.

yeah you generally can't approximate sqrt faster than computing it. sqrt is generally roughly as fast as division.