RISC-V Vector Extension in a Nutshell (Part 3): mask & masked operations
This article is part 3 of a series on RISC-V vector extension, and focuses on masks.
Part 1 introduces the series with a general overview, Part 2 reviews arithmetic operations, and Part 4 describes permute instructions. Part 5.1 and Part 5.2 review memory operations.
Masked operations
As stated in part 2, most operations in RVV 1.0 can be masked, that is an extra mask operand can be provided through v0 to select which elements will be actually modified by the operation.
Most operation use the 25-th opcode bits to select if the operation is masked or not. If the operation is unmasked, (inst[25]=1) all body elements are considered active. If the operation is masked (inst[25]=0), then each element lane is associated with a bit in v0, if the bit is set then the lane operation is performed, else the result is either left unmodified (undisturbed or agnostic mask policy) from its previous value in vd or it can be filled with all1s (agnostic mask policy).As for tail, agnostic policy allows the implementer to chose between all1s and undisturbed elements, and this on a per-lane basis. As @AngelineCaroCol pointed out on twitter , this choice can be exploited to reduce the number of value a processor using register renaming must used: an old-destination-data read can be saved if the implementation choses to fill in all1s in the agnostic case.In RVV 1.0, only v0 can be used as the mask operand for a masked operation. RISC-V task group mentioned a possible future extension of RVV to 64-bit opcode with more room to specify an arbitrary mask register. v0 can also be used as a standard vector register, and could even be used as both a mask source and a standard destination in an operation. Using v0 to hold an input mask reduces the number of available registers and may greatly increase register allocation pressure. The effect worsens as LMUL grows. The worst case is for LMUL=8, when using v0 as mask source will reduce the number of vector register group from 4 to 3. This decrease is limited to 1 over 32 when LMUL <= 1.
Note: Some operations, such as the reductions, (e.g.
vredsum.vs vd, vs2, vs1, vm
) rely on the mask the select the active input elements and not the active result element: the 0-th element in the destination vd is still modified even if the operation is masked and v0[0] = 0. But in such case, the 0-th element of the input vector vs2 will not be accumulated in the reduction result. This is illustrated by the figure below.
VLEN and mask size
Because the group multiplier, LMUL, cannot exceed 8, the mask size of a mask, even for the largest possible vector group, is VLEN. This means that a single vector register is wide enough to fit a mask even for an operation with LMUL=8 and the smallest SEW, there is no need to combine multiple registers to build a mask register.
For largest element width, only a portion of the VLEN-wide vector register is necessary. For example for SEW=64-bit, an LMUL=8 vector group contains VLEN/SEW*LMUL = VLEN/8 elements: only 1/8 of the vector register contains a valid mask bit.
Operation producing masks
The exist a few instructions to manipulate masks that we will review in the next section, but there are also specific instructions to produce masks for non-mask operands.
These instructions include the integer and floating-point comparisons: those instructions perform element-wise comparisons and produce a bit per comparison, concatenating them into a result mask.
Those instructions can be masked: only the active elements comparison are performed, the result of inactive element depends on the mask policy.
Operation with masks
https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#sec-vector-mask
RVV 1.0 also specifies operations on masks, such as vmand.mm
, those operations accept mask operands and produce a mask result. Those mask operations do not have restriction on mask register, any vector register can be used as a mask input or output. The final mask operation will need to target v0 or the mask result will have to be moved into v0 before being used as a mask operand.
Note: the tail policy on operation with mask is a bit more permissive as it allows to implement every operation assuming agnostic policies or assuming vl is extended which simplifies tail management (not forcing micro-architecture to manage tail at the bit granularity). More details are available in this section of the RVV 1.0 specification.
RVV 1.0 specifies standard bitwise logic operations on mask: https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#151-vector-mask-register-logical-instructions.
RVV 1.0 also specifies:
- population count (popcount) on mask: vcpop.m
, useful to count the active bit set in a mask, this instruction writes its result to a scalar register.
- p leading zero count on active element: vfirst.m
, this instruction also write to a scalar register, this operation can be masked it counts inactive elements as zeros in the vs2 source.
- instructions to build mask around the first active 1: vmsbf.m
, vmisf.m
, vmsof.m
- viota
and vid
to build accumulated active indices and indices vectors.
Conclusion
This article has provided a coarse survey of RVV 1.0 masked instructions (variant of standard instructions), and instructions using or producing masks.
In the next post, we will review the permute instructions.
Many thanks to @AngelineCaroCol, she inspired part of this blog post during a very interesting exchange about a previous post.
Updated Oct 24th 2022 with more links to other parts of this blog series.
Updated Feb 2nd 2023 for publication on substack
Updated Apr 8th 2024 to correct a typo (thank you Shaka for pointing it out)
References:
RVV 1.0 operations with masks https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#sec-vector-mask