Discussion about this post

User's avatar
Al Martin's avatar

Fprox, as usual, a great article. A few comments:

- In the section about Floating-Point reduction sums, you answered the question about "what's the difference between the ordered and unordered versions", but you didn't address why you would want to use the slower (ordered) version. As your mini-benchmarks demonstrate, the unordered version is faster (although, surprisingly not that much faster when there are a lot of elements). You and I both know, but a casual reader might not know that due to quirks in the way rounding is done, floating-point arithmetic isn't associative-- meaning that the ordered version will give consistent results across implementations and vector lengths, while the unordered version may not. (It's also worth noting that integer sum reductions do not round and are associative, but you do have to be careful about overflow.)

- in your figure illustrating the unordered reduction, you should "gray out" the arrows from the masked elements and the tail elements.

- in the "Reduction instruction latencies as a function of vl" plot, how do you explain the outliers (vredsum with vl=13, and vfredosum with vl=19)?

Expand full comment
11 more comments...

No posts