12 Comments
User's avatar
Al Martin's avatar

Fprox, as usual, a great article. A few comments:

- In the section about Floating-Point reduction sums, you answered the question about "what's the difference between the ordered and unordered versions", but you didn't address why you would want to use the slower (ordered) version. As your mini-benchmarks demonstrate, the unordered version is faster (although, surprisingly not that much faster when there are a lot of elements). You and I both know, but a casual reader might not know that due to quirks in the way rounding is done, floating-point arithmetic isn't associative-- meaning that the ordered version will give consistent results across implementations and vector lengths, while the unordered version may not. (It's also worth noting that integer sum reductions do not round and are associative, but you do have to be careful about overflow.)

- in your figure illustrating the unordered reduction, you should "gray out" the arrows from the masked elements and the tail elements.

- in the "Reduction instruction latencies as a function of vl" plot, how do you explain the outliers (vredsum with vl=13, and vfredosum with vl=19)?

Expand full comment
camel-cdr's avatar

I wonder if it's faster to use the unordered reductions and accumulate in a binary tree, if you can modify the source array.

That should still give you a reproducable result and should be possible to do VLEN agnostic.

Expand full comment
FPRox's avatar

Hey @camel-cdr, What do you mean, in particular with " if you can modify the source array." ?

Expand full comment
camel-cdr's avatar

I mean that if you want to compute the sum of a float array, you are able to overwrite the values in it, because it is discarded afterwards anyway.

The "reason" you want a serial sum, is such that the result is always exactly the same on any VLEN and on scalar implementations.

This sacrifices a lot of performance.

You can get the same properties by doing a binary tree reduction over the entire array. In the worst case you have to do 2x more additions and loads, but you can trivially get the extra loads down to 1.25 or 1.175x. The additions can almost all be executed in parallel, so the 2x more additions is extreamly cheap.

Incedentally, the floating-point error (difference from infinite precision sum) of this approach is also lower than the serial addition.

There is another approach, which is using multiple accumulators. This is great, if you don't need the same result on all platforms, but bad, if you do. Because you are forced to use the same accumulator count across platforms. E.g. the 64 element accumulator you would want for avx512 can't have a fast scalar implementation.

https://godbolt.org/z/qsrP7zsnx

I did a few benchmarks, and the tree reduction is always faster than linear, within 10-30% of optimal performance of the multiple accumulators approach (which can only be reached with a architecture specific accumulator count), and a lot more accurate than the other approaches.

The only problem is that you need to be able to modify the source array, although I think a hybrid approach is possible. I'll probably write an article about that soon.

Expand full comment
camel-cdr's avatar

I suppose you could use recursion instead of overwriting the array, I haven't benchmarked that yet, but it seems less predictable for the CPU, and I like putting vector registers on the stack.

Expand full comment
Al Martin's avatar

I don't think it's possible to rearrange the sources to be summed in the same order as the (sequential) ordered sum.

(a+b)+(c+d) != (((a + b) + c) + d)

In other words, floating-point arithmetic is also not distributive, due to rounding.

Expand full comment
camel-cdr's avatar

The idea is that you don't need to do a sequential sum for reproducability, as long as you do another fixed order for both scalar and vector. See my answer to FPRox abovd.

Expand full comment
Al Martin's avatar

Are you saying you can get reproducability with a binary tree approach, but it's not necessarily the same result as a sequential approach?

I'm having trouble visualizing how you would do that in a way that is independent of the vector length. I also don't see how this could be guaranteed across different implementations, unless the spec mandates how to do it.

There are a lot of algorithms that don't need accuracy down to the last ulp. For those you can use vfwredusum, followed by converting back down to the original format if desired. (Although, as Fprox pointed out, there's no 64-bit (double-precision) version of this.)

Expand full comment
FPRox's avatar

Thank you Al.

> but you didn't address why you would want to use the slower (ordered) version.

You are right, I only cited the section of the spec about the used of the ordered version being a valid way to autovectorize a loop, I should add more details.

> (It's also worth noting that integer sum reductions do not round and are associative, but you do have to be careful about overflow.)

I actually mentioned that: "Both integer reduction sums wrap on overflow." and "For all the reduction operations listed above the order of evaluation does not matter as the elementary 2-operand operations are associative and commutative: all orders return the same result."

What do you mean with but you do have to be careful about overflow. ?

> - in your figure illustrating the unordered reduction, you should "gray out" the arrows from the masked elements and the tail elements.

that is a great idea for the masked elements, it would show that the same binary tree is used whether the elements are masked or not. I am a bit more circonspect for the tail, as the structure of the tree can actual change with vl, so I should maybe "gray out" the operator node as well.

> - in the "Reduction instruction latencies as a function of vl" plot, how do you explain the outliers (vredsum with vl=13, and vfredosum with vl=19)?

I did not look too much into that, I would imagine that the benchmark sometimes conflict with a context switch trashing the caches or something like this. I run 30 times the same instruction each time and I wanted to present the actual result I got. I should run it a few times and exclude extrema values. I don't believe they correspond to an actual slowdown for particular values of vl in the K230 micro-architecture.

Expand full comment
Al Martin's avatar

> What do you mean with but you do have to be careful about overflow. ?

Just that, it's not obvious if an integer overflow has occured. RISC-V doesn't save the traditional NZCV integer flags, but I'm not sure they would even be useful in a reduction anyway. Interestingly though, an overflow isn't sticky if you have signed numbers, a negative number could bring you back in range. But you can't tell if there was a wraparound without examining the elements.

With vlen=64k bits, reduction with element widths 16/32/64 produce 25/40/71 bit results (if reduced in a binary tree), so vwredsum is adequate and can't overflow. However, element width of 8 bits can produce an 18 bit result, so it's possible for even vwredsum to overflow. This scenario would need to restrict maximum vlen to 16k bits to prevent overflow.

Expand full comment
FPRox's avatar

I see. But there are no vwredsum for 64-bit elements (not yet anyway :-)). So I think for any VLEN if you want to detect overflow for 64-bit element you will need another way (possibly splitting the numbers and summing them independently).

Expand full comment
Al Martin's avatar

I was toying with the idea of "vwredsumh(u)" (similar in concept to mulh(u)) to handle the SEW=8 case, but as you point out, it could also be useful for SEW=64 case. (So while we're at it, we could also provide it for SEW=32, in the max_sew=32 case, and then similarly for SEW=16.)

Expand full comment