4 Comments
Jan 10Liked by Fprox

> if avl is greater than VLMAX then VLMAX is returned

That'd be logical, right? Too bad the RVV spec likes to throw curveballs at unsuspecting developers.

https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#constraints-on-setting-vl

Expand full comment
author
Jan 10·edited Jan 10Author

You are right, this is only one of the allowed implementation behavior, I need to correct my sentence.

Expand full comment
author

Post updated on Jan 10th 2024 to fix this error.

Expand full comment
Nov 23, 2023Liked by Fprox

I recommend https://dzaima.github.io/intrinsics-viewer/ as a reference for the intrinsics.

I ran the float sum benchmark with 10000 elements and rdcycle on a C920, here are the results:

scalar: 27000 cycles

LMUL=1: 10792 cycles

LMUL=2: 9337 cycles

LMUL=4: 8702 cycles

LMUL=8: 10553 cycles

You can see how LMUL>1 basically acts as loop unrolling, as the C920 has DLEN<=VLEN. The reason LMUL=8 is slower than LMUL=4 is, presumably, because the core can issue one 512 bit load and one 512 bit store in parallel, but with LMUL=8 it can't (or rather doesn't) interleave the load stores. I expect future implementations to not suffer from this problem.

Expand full comment