Discussion about this post

User's avatar
Al Martin's avatar

Fprox, interesting, as usual.

Question:

In your chart of K230 scalar latencies, you show fdiv.d having a latency of 8 cycles, which seems implausible if fmul.d is 6 cycles and div is 25 cycles. It might be possible if it's divide-by-zero or divide-by-power-of-2, but not in general unless it's a completely unrolled Newton-Raphson implementation. (That would be a lot of gates for this CPU, and a bad tradeoff for gates vs. latency vs. how often the instruction is used.) Can you clarify this?

Expand full comment
Tien Tu Vo's avatar

It's a great article and very structured analysis.

As I'm new on the topic, I would like to understand better if there is a latency difference between 32-bit vs 64-bit floating point (for example fmul.d vs fmul.s).

Did you bench the load and store instruction on 32-bit and 64-bit floating point ?

Expand full comment
2 more comments...

No posts