I ran the float sum benchmark with 10000 elements and rdcycle on a C920, here are the results:
scalar: 27000 cycles
LMUL=1: 10792 cycles
LMUL=2: 9337 cycles
LMUL=4: 8702 cycles
LMUL=8: 10553 cycles
You can see how LMUL>1 basically acts as loop unrolling, as the C920 has DLEN<=VLEN. The reason LMUL=8 is slower than LMUL=4 is, presumably, because the core can issue one 512 bit load and one 512 bit store in parallel, but with LMUL=8 it can't (or rather doesn't) interleave the load stores. I expect future implementations to not suffer from this problem.
> if avl is greater than VLMAX then VLMAX is returned
That'd be logical, right? Too bad the RVV spec likes to throw curveballs at unsuspecting developers.
https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#constraints-on-setting-vl
You are right, this is only one of the allowed implementation behavior, I need to correct my sentence.
Post updated on Jan 10th 2024 to fix this error.
I recommend https://dzaima.github.io/intrinsics-viewer/ as a reference for the intrinsics.
I ran the float sum benchmark with 10000 elements and rdcycle on a C920, here are the results:
scalar: 27000 cycles
LMUL=1: 10792 cycles
LMUL=2: 9337 cycles
LMUL=4: 8702 cycles
LMUL=8: 10553 cycles
You can see how LMUL>1 basically acts as loop unrolling, as the C920 has DLEN<=VLEN. The reason LMUL=8 is slower than LMUL=4 is, presumably, because the core can issue one 512 bit load and one 512 bit store in parallel, but with LMUL=8 it can't (or rather doesn't) interleave the load stores. I expect future implementations to not suffer from this problem.