vzip.vv could previously be done with `vwmaccu.vx(vwaddu.vv(a, b), -1U, b)`, vunzip[e,o].vv is vnsrl.vi and vpaire/vpairo are masked vslide1up/vslide1down.
I am pretty sure everyone would be happier using vzip rather than the widening mac sequence :-) (and I am pretty sure this will be much more energy efficient, toggling a multiplier is getting expensive those days).
With respect to the masked slides, why vslide1up/vslide1down and not masked vslide.vi ? (here the benefit of the new instructions would be similar to implementing the unzip with a vcompress I guess: less instructions, less need to materialize mask and hopefully better suited datapath when hardware is not an issue)
Yeah, that wasn't a critique of the instructions. I just wanted to point out how to do the equivalent thing today.
vzip.vv is a great improvement over the widening macc sequence, vunzip doesn't give much of an advantage over vnsrl.vi, but does also work for the SEW=64 case.
The vpair[e,o] instructions are also great, because they are extremely cheap to implement and will likely have a higher throughput and lower latency than using the vslide instructions, you also don't need to move the masks around so much for the common transpose usage.
BTW, vdot4au.vv also sounds interesting for number parsing.
vzip.vv could previously be done with `vwmaccu.vx(vwaddu.vv(a, b), -1U, b)`, vunzip[e,o].vv is vnsrl.vi and vpaire/vpairo are masked vslide1up/vslide1down.
I am pretty sure everyone would be happier using vzip rather than the widening mac sequence :-) (and I am pretty sure this will be much more energy efficient, toggling a multiplier is getting expensive those days).
With respect to the masked slides, why vslide1up/vslide1down and not masked vslide.vi ? (here the benefit of the new instructions would be similar to implementing the unzip with a vcompress I guess: less instructions, less need to materialize mask and hopefully better suited datapath when hardware is not an issue)
Yeah, that wasn't a critique of the instructions. I just wanted to point out how to do the equivalent thing today.
vzip.vv is a great improvement over the widening macc sequence, vunzip doesn't give much of an advantage over vnsrl.vi, but does also work for the SEW=64 case.
The vpair[e,o] instructions are also great, because they are extremely cheap to implement and will likely have a higher throughput and lower latency than using the vslide instructions, you also don't need to move the masks around so much for the common transpose usage.
BTW, vdot4au.vv also sounds interesting for number parsing.
> Yeah, that wasn't a critique of the instructions.
I figured :-), IIRC you were the direct driver behind some of those new instructions
https://gist.github.com/camel-cdr/99a41367d6529f390d25e36ca3e4b626