(this article was previously published on my blog https://fprox.blogspot.com/2022/07/risc-v-vector-extension-in-nutshell.html)
Introduction
This article is the first of the series "RISC-V Vector Extension in a Nutshell" which presents the new vector extension for the RISC-V ISA. It covers some basic concepts of RVV (RISC-V Vector extension)
RVV defines a new instruction set extension for the open RISC-V ISA (https://riscv.org/technical/specifications/).
This extensions adds 32 vector registers, each VLEN-bit wide. Contrary to SIMD extensions (e.g. ARM Neon or x86 SSE/AVX) VLEN is an implementation parameter not limited to one value: multiple implementations can have different VLEN value while being fully compliant with the RVV standard. Programs for RVV can be built according to vector length agnostic (VLA) principles, meaning that theoretically they can execute on implementations with different VLEN (among the range officially allowed by RVV).
Vector Length and element type
Most instructions in RVV are affected by global parameters: vl, LMUL, SEW.
vl is the vector length, it defines on how many elements will the next vector operations be executed. The element before the vl mark are part of the vector head or vector body, The element past vl are part of the tail.
SEW is the Selected Element Width, that is the elementary size (in bits) of an element in the input and output vectors.
LMUL is the group multiplier, it defines the size of the vector group for vector operation input and output operands, that is the number of vector register(s) forming the group. A vector register group is aligned on the group size, e.g. when LMUL=4, only the register indices v0, v4, v8, v12, v16, v20, v24, v28 are allowed, v0 encodes for v0v1v2v3. Fractional LMULs are legal, and constrain vl to only a fraction of a single vector register.
For example if vl=4, SEW=16b, a vfmadd.vv
operation perform a half precision floating-point multiply and add on 4-element vectors, that is 4 FMAs.
vl is limited by VLMAX=LMUL * VLEN / SEW.
NOTE: some operations (e.g. widening operations) have different element widths for their inputs and operands. EEW, the effective element width, may differ from SEW.
NOTE: they are other parameters such as vstart. A comprehensive description of the vector segment can be found in the specification https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-inactive-defs.
Tail policies
As stated in the previous section, the elements past the vector length vl are considered as part of the tail: elements which are not affected by the current operation.
RVV defines two tail policies: undisturbed and agnostic. In the undisturbed tail policy, the elements past vl in the destination register are left unmodified (that is keep the same value as the one they had before the operation). In the agnostic tail policy, the tail elements may be left undisturbed or fill in with all 1s.
The tail can be used to apply the vector operations to an arbitrary number of elements.
The following diagram presents an example with a VLEN=128 bits LMUL=2 vector group, interpreted as a vl=7, SEW=32 bits or a vl=3, SEW=64 bits.
Masking
Most RVV instructions accepts an extra optional mask operand. This bitmask defines which of the result element should be actually modified by the operation. Similarly to the tail behavior, RVV 1.0 defines two mask policies: mask undisturbed where mask-off elements in the destination register keep the value they had before the operation and mask agnostic where mask-off elements can either be undisturbed or written with all1s.
Part 3 of this blog series will surveys operations with or on masks,
Example
An assembly program in RVV might look like the following:
vsetvli x1, zero, e64, m1, ta, mu
vfmul.vf v15, v12, ft5
vfmul.vf v17, v12, ft6
Let us review this snippets in more details.
vsetvli x1, zero, e64, m1, ta, mu
The first instruction is a vsetvli
which defines the SEW: e64 (64-bit element), the group multiplier LMUL: m1 (a single register per group), the tail policy: ta (tail agnostic), the mask policy: mu (mask undisturbed). It also request a vector length vl and get the actual vector length value. In this specific case, the source register is zero (x0 in the integer register file) : this means that this instruction requests an AVL equal the maximum vl value supported for this pair SEW/LMUL (vlmax), and that the actual result vl is stored into x1.
vfmul.vf v15, v12, ft5
vfmul.vf v17, v12, ft6
The two remaining instructions are vector floating-point multiplication, between a vector and a scalar. The scalar is read in the scalar floating point register file.
The behavior of the instructions is modified by the vl, LMUL and SEW parameter. Assuming we are executing on an implementation with VLEN=512 bits, the vsetvli
will define vl to be 8 (512 / 64), and both vfmul.vf
will perform double precision (SEW=64) vector multiplication between a 8-element vector and a splat scalar and store the 8-element results in a vector register.
Conclusion
The series continue:
Addendum: Many thanks to Hugues for pointing out an error in the initial version of this post (I used vsetvli zero, zero, ...
which has the effect on keeping vl unchanged and only changing element width and mask/tail policies).
References
- RVV 1.0 specification on github
Thanks
Thank you to Ken P. for pointing out that external links got corrupted during import.