This article is part 2 of a series on RISC-V vector extension.
Part 1 introduced the series with a general overview, Part 3 surveys operations with or on masks, Part 4 describes permute instructions and Part 5.1 / Part 5.2 cover memory operations.
In this second part we will review the basic families of RVV 1.0 arithmetic and logic instructions.
Most of those instructions (the most notable exception being the reductions) operate element-wise as illustrated by the figure below. Only the body mask-on elements (for a masked operation) are modified.
Masked-off and tail elements follow mask and tail policies (which are parameters defined in the vtype
CSR register).
The behavior of those operations is tuned by the global parameters: vstart, vl/LMUL, and SEW (respectively defined by the vstart CSR register and by subfields of the vtype
CSR).
vstart specifies the first active vector element, vl specifies the number of element(s) in the LMUL-wide vector group affected by the operation and SEW specifies what is the actual format of the operands.
Optionally those operations can be masked: a mask is read from register v0
and the operation is only performed if the corresponding mask bit is set, if the bit is unset, the result value is not defined by the operation but by whether a mask undisturbed or a mask agnostic policy is selected (global parameter).
Vector Integer Arithmetic
RVV 1.0 specifies standard integer operations:
- add, sub
- min/max
- zero and sign extensions
- widening add/sub
- add with carry / sub with borrow
Operand variants
Most of the integer arithmetic operations admit three variants: .vv, .vx, and .vi.
.vv describes an element wise operation between vector registers while .vx, (resp. .vi) describes an element-wise operation between vector register(s) and a splat scalar value (resp. scalar immediate). The scalar value is read from the scalar register file x while the immediate value is part of the opcode (generally the rs1
bitfield).
Those variants decrease register pressure for vector operations with at least one uniform operand. Although .vx
requires a register read from the scalar register file.
Widening add/sub
Some operations, .e.g. vwadd.vv,
have a wide result compared to the size of the operand elements: each operand is interpreted as a vector of SEW-wide elements, while the result elements are (2 * SEW)-bit wide. In this case the effective element width (EEW) of the result differs from the SEW setting.
There is often an extra variant for widening operations: .wv. This variant interprets one of the operand as wide (on top of the result). Such operation requires reading up to 5*SEW bits per element operations: 3 * SEW bits for the operands, and an extra 2*SEW for the result old-destination-data (tail or mask undisturbed policy) and it requires writing 2*SEW bits per result element.
Carry and borrow
RVV uses the mask register (v0) to store carry/borrow inputs. The bit from this mask is combined with an addition (resp. subtract) in the vadc
(resp. vsbc
) instruction family. This instruction produces the addition result. A variant, whose mnemonic is vmadc
can be used to provide the carry mask result. Since vector instructions produces a single result, it is not possible to get simultaneously the add-with-carry result and carry out.
Note: A superscalar implementation could perform instruction fusion between these two operations and execute them as a single micro-op producing both result and output carry mask.
Any register can be used to store the result mask, but as for most RVV instructions only v0 can be used as input mask. (We will come back to this when we speak about mask operation in a future episode).
Vector Fixed-Point Arithmetic
RVV specifies:
saturating and averaging operations.
fractional multiplies with rounding and saturation
scaling shifts
narrowing fixed-point clips.
The rounding is performed according to the rounding mode set in the vxrm register (a csr register part of vcsr).
Saturating operations saturate to minimal or maximal values on overflow (and set the status bit vxsat).
For averaging operation, the operation result is right shifted by one bit (equivalent to a division by 2).
The fixed point formal for fractional multiplies depends on SEW : it assumes operands with (SEW - 1) fractional bits and produces a result with (SEW - 1) fractional bits too, performing rounding to remove the extra fractional bits.
Vector Floating-Point Arithmetic
https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#13-vector-floating-point-instructions
RVV defines a standard set of floating-point addition, subtraction, multiply and various variant of fused multiply-add which operate with a single final rounding.
Operations exist in both single-width and widening forms. Fused multiply and add instructions only exist in destructive form: one of the operand is also the destination register overwritten by the operation.
One of the most demanding operations in terms of operands is vfwadd.wv vd, vs2, vs1
: in undisturbed configuration (tail or mask), when LMUL=1, this operations may require 5 register inputs (one for vs1 and two for each of vs2 and vd) and may produce 2 register outputs (similarly to the case described for integer widening instructions).
The floating-point set also contains vector division, square root and estimate operations for reciprocal (vfrec7.v
) and reciprocal square root (vfrsqrt7.v
). These estimates provide approximation with 7-bit accuracy, useful to implement Newton-Raphson based division or square root approximations or argument reductions for elementary functions such as logarithm.
Finally the floating-point set includes floating-point min/max, sign injection, numerous format conversions and compare operations.
Compare operations provide their result as a mask: the boolean result of each element-wise comparison is encoded by the corresponding mask bit. We will come back to the compare when we review the mask operations in a future article.
Vector Reductions
Specification: https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#vector-reduction-operations
RVV 1.0 specifies a few reduction operations. Those operations expect a scalar operand (element 0 from the vector register operand vs1) and a vector group (vs2) and reduce all the active elements from vs2 with the scalar operand into a single scalar value stored into element 0 of the destination register vd.
RVV 1.0 specifies (f)min(u)/(f)max(u), and/or/xor, single-width and widening sum reductions (integer or floating-point).
RVV 1.0 offers two types of floating-point sum reductions: ordered and unordered.
ordered forces a sequential order of evaluation for the intermediary additions, its result is deterministic across implementations and can be applied to the vectorization of reduction loops in most language when a sequential semantic is expected (e.g. C/C++ when -ffast-math, -fassociative-math or another aggressive optimization option are not applied).
unordered offers more room for micro-architectural implementations. As long as the implementation is deterministic it can implement the unordered reduction as a multi-level binary tree (even allowing different precisions in the tree).
Conclusion
In the next part we review operations with on our masks:
In the widening operations, the result is double the size of sources. Does that mean we need to use a 16-Bit ALU for 8-Bit Operations. Or is it just the zero/sign extension of result from 8-bit operation.