Comment by Const-me

13 hours ago

> Intrinsics are faster, but you'll need several Newton-Raphson iterations for precision

I wonder have you tried non-approximated intrinsics like _mm_div_ps( mul, _mm_sqrt_ps( div2 ) ) ?

The reason standard library is so slow is exception handling and other edge cases. On modern CPUs normal non-approximated FP division and square root instructions aren’t terribly slow, e.g. on my computer FP32 square root instruction has 15 cycles latency and 0.5 cycles throughput.

1 comment

Const-me

adgjlsfhk1 12 hours ago

yeah you generally can't approximate sqrt faster than computing it. sqrt is generally roughly as fast as division.