Discussion about this post

User's avatar
-.-'s avatar

> if avl is greater than VLMAX then VLMAX is returned

That'd be logical, right? Too bad the RVV spec likes to throw curveballs at unsuspecting developers.

https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#constraints-on-setting-vl

Expand full comment
camel-cdr's avatar

I recommend https://dzaima.github.io/intrinsics-viewer/ as a reference for the intrinsics.

I ran the float sum benchmark with 10000 elements and rdcycle on a C920, here are the results:

scalar: 27000 cycles

LMUL=1: 10792 cycles

LMUL=2: 9337 cycles

LMUL=4: 8702 cycles

LMUL=8: 10553 cycles

You can see how LMUL>1 basically acts as loop unrolling, as the C920 has DLEN<=VLEN. The reason LMUL=8 is slower than LMUL=4 is, presumably, because the core can issue one 512 bit load and one 512 bit store in parallel, but with LMUL=8 it can't (or rather doesn't) interleave the load stores. I expect future implementations to not suffer from this problem.

Expand full comment
3 more comments...

No posts