Emulating future RISC-V Vector extensions
Playing with your toy before Christmas
Lately, I have been considering the following problem: how to develop an application example for a future RISC-V (Vector) extension with no fast hardware support yet ?
Usually, I would rely on an instruction set simulator (like spike) and one could also use a machine emulator such as QEMU. But with the growing availability of RISC-V Vector hardware, I wanted to explore an other solution: emulating new RISC-V Vector instructions using RVV 1.0 and here comes: rvv-intrinsic-emulation.
rvv-intrinsic-emulation (RIE1) is a very simple code generator which generates emulation of RISC-V Vector extensions using RVV 1.0. It currently supports Zvkb (vector bit manipulation), Zvdot4a8i (a fast track project of 4-element vector integer dot product), Zvzip (a fast track project for interleaving/de-interleaving vector elements) and should support Zvabd soon (fast track project for vector absolute difference).
Because the generated code uses RVV 1.0 intrinsics, it can easily be executed on existing hardware with RVV 1.0 support (such as BananaPi-F3, CanMV K230, ….). This makes it quite useful to develop and execute large applications targeting future vector extensions even before hardware support becomes available. The other benefit is that no compiler support is required for the new instructions, the application simply makes call to the intrinsic API which is emulated thanks to RIE-generated code.
What rvv-intrinsic-emulation is not
RIE is not a compiler, code generation is very (very, very) basic and most of the expertise needs to be injected when the emulation description is developped (e.g. there are no constant propagation, …). The initial goal is to provide accelerated functionality (using hardware support), but at the moment the emulation is definitely not optimal (most of the speed-up comes from mapping to hardware vector instructions).
User guide for rvv-intrinsic-emulation
RIE produces self contained emulation functions which follow the RISC-V Vector API pattern. For example, for the vdot4au.vv instruction from the extension Zvdot4a8i with LMUL=1 and tail-undisturbed policy, it generates the following prototype:
vuint32m1_t __riscv_vdot4au_vv_u32m1_tu(vuint32m1_t vd,
vuint32m1_t vs2,
vuint32m1_t vs1,
size_t vl);The prototype and the implementation can be generated as follows:
$ python3 scripts/generate_emulation.py -e zvdot4a8i --lmul m1 --tail-policy tu --mask-policy umThe command line interface allows the user to filter desired LMUL, element width, tail and mask policies values. The tool generates a file which can be embedded as a header into the target application, providing a lightweight translation layers between new/unsupported/unratified instructions and existing hardware.
For example, the previous command generates:
(...)
// Zvdot4a8i definitions (LMUL=m1), tail_policy=undisturbed, mask_policy=unmasked
vuint32m1_t __riscv_vdot4au_vv_u32m1_tu(vuint32m1_t vd, vuint32m1_t vs2, vuint32m1_t vs1, size_t vl) {
vuint8m1_t tmp0 = __riscv_vreinterpret_v_u32m1_u8m1(vs2);
vuint8m1_t tmp1 = __riscv_vreinterpret_v_u32m1_u8m1(vs1);
size_t tmp2 = vl * 4;
vuint16m2_t tmp3 = __riscv_vwmulu_vv_u16m2(tmp0, tmp1, tmp2);
vuint64m2_t tmp4 = __riscv_vreinterpret_v_u16m2_u64m2(tmp3);
vuint32m1_t tmp5 = __riscv_vnsrl_wx_u32m1(tmp4, 32, vl);
vuint16m1_t tmp6 = __riscv_vreinterpret_v_u32m1_u16m1(tmp5);
vuint32m1_t tmp7 = __riscv_vnsrl_wx_u32m1(tmp4, 0, vl);
vuint16m1_t tmp8 = __riscv_vreinterpret_v_u32m1_u16m1(tmp7);
size_t tmp9 = vl * 2;
vuint32m2_t tmp10 = __riscv_vwaddu_vv_u32m2(tmp6, tmp8, tmp9);
vuint64m2_t tmp11 = __riscv_vreinterpret_v_u32m2_u64m2(tmp10);
vuint32m1_t tmp12 = __riscv_vnsrl_wx_u32m1(tmp11, 32, vl);
vuint32m1_t tmp13 = __riscv_vnsrl_wx_u32m1(tmp11, 0, vl);
vuint32m1_t tmp14 = __riscv_vadd_vv_u32m1(tmp12, tmp13, vl);
vuint32m1_t tmp15 = __riscv_vadd_vv_u32m1_tu(vd, tmp14, vd, vl);
return tmp15;
}
(...)The code is rather verbose (using the non-overloaded explicit RVV intrinsics API) and it is not expected to be modified manually.
This code can be built by any compiler with support for standard RVV 1.0 intrinsics.
Developper guide for rvv-intrinsic-emulation
RIE is built around a python description. The emulation algorithm is described with Operation nodes, with explicit OperationType, and NodeFormatDescriptor (format). LMUL, element type, mask and tail policy are provided as parameter and the emulation code is dynamically generated based on the values of those arguments.
The representation is very verbose, and still under definition, but sufficient for current needs.
def and_not(op0: Node, op1: Node, vl: Node, vm: Node = None, dst: Node = None, tail_policy: TailPolicy = TailPolicy.UNDEFINED, mask_policy: MaskPolicy = MaskPolicy.UNDEFINED) -> Node:
"""Generate vector andn (and not) using operation RVV 1.0 operation only."""
not_desc = OperationDescriptor(OperationType.NOT)
not_op1 = Operation(op1.node_format, not_desc, op1, vl)
and_desc = OperationDescriptor(OperationType.AND)
return Operation(op0.node_format, and_desc, op0, not_op1, vl, vm=vm, dst=dst, tail_policy=tail_policy, mask_policy=mask_policy)source: src/rie_generator/zvkb_emulation.py#L96-L101
A few basic helper functions are provided, for example emulate_with_split_lmul (source) which builds the emulation code for EMUL by splitting it into two EMUL/2 sub-calls (useful if the emulation code itself requires double the input LMUL value and you need to emulate an LMUL=8 operation).
Once the emulation algorithm is described, you can declare the prototype of the operation you want to emulate and associate it with an emulation sequence:
# declaration of formats
uint_t = NodeFormatDescriptor(NodeFormatType.SCALAR, elt_type, lmul_type=None)
vuintm_t = NodeFormatDescriptor(NodeFormatType.VECTOR, elt_type, lmul)
vbooln_t = NodeFormatDescriptor(NodeFormatType.MASK, elt_type, lmul)
# declaration of operands
lhs = Input(vuintm_t, 0)
rhs = Input(vuintm_t, 1)
vm = Input(vbooln_t, -2, name="vm")
vd = Input(vuintm_t, -1, name="vd")
rhs_vx = Input(uint_t, 1)
(...)
# declaration of vand.vv prototype
vuintm_vandn_vv_prototype = Operation(
vuintm_t,
OperationDescriptor(OperationType.ANDN),
lhs,
rhs,
vl,
vm = mask,
tail_policy = tail_policy,
mask_policy = mask_policy,
dst = dst
)
# definition of vand.vv emulation sequence
vuintm_vandn_vv_emulation = and_not(lhs, rhs, vl, vm=mask, dst=dst, tail_policy=tail_policy, mask_policy=mask_policy)
# declaration of vand.vx (vector-scalar) prototype
vuintm_vandn_vx_prototype = Operation(
vuintm_t,
OperationDescriptor(OperationType.ANDN),
lhs,
rhs_vx,
vl,
vm = mask,
tail_policy = tail_policy,
mask_policy = mask_policy,
dst = dst
)
# definition of vand.vx emulation sequence
vuintm_vandn_vx_emulation = and_not(lhs, rhs_vx, vl, vm=mask, dst=dst, tail_policy=tail_policy, mask_policy=mask_policy)Application example(s)
RIE contains two application examples: an 8-bit integer matrix multiply relying on Zvdot4a8i intrinsics: tests/src/test_zvdot4a8i.c, and the 4x4 matrix transpose example listed in Zvzip draft specification: tests/src/test_zvzip.c (source for the example).
With the magic of RIE, the following code sequence can be compiled without requiring any advance compiler support for Zvdot4a8i:
for (int k = 0; k < TEST_SIZE_K; k += 4) {
vuint32m1_t vlhs = __riscv_vlse32_v_u32m1((uint32_t*)(lhs + i * TEST_SIZE_K + k), TEST_SIZE_K, vl);
// building right hand side operand
uint32_t rhs_4elts = 0;
for (int l = 0; l < 4; l++) {
rhs_4elts += (uint32_t) (rhs[(k + l) * TEST_SIZE_N + j]) << (l * 8);
}
vout = __riscv_vdot4au_vx_u32m1(vout, vlhs, rhs_4elts, vl);
}Thanks to RIE, it becomes possible to develop and execute those applications on any RVV1.0 capable implementation (hardware and compiler).
Evaluation
The benefit (or “drawback” :-) ) of developing an emulation layer targeting real hardware is that it becomes easy to measure its performance.
Using clang 18.1.3 with -march=rv64gcv and -O2, and a BananaPi-F3 (Spacemit X60 with RVV 1.0) we have benchmarked a subset of the intrinsics that RIE already supports. The results are presented below. We first share the number of instruction per intrinsics (this should be an exact integer since there are no control flow in the current emulations). This metric is a measure of the conciseness of RIE emulation algorithms and of the compiler capabilities to optimize the code generation. Second, we share the average latency in cycles per instructions (measured for VLMAX, although this has no impact on X60 performance). The latency is a measure of the quality of the emulation and the capability of the micro-architecture.
The measurements, in particular the latency, demonstrate that RIE definitely does not run at the speed of a dedicated hardware implementation. A rule of thumb to evaluate this impact would be to consider the pure hardware latency for those instructions to be around LMUL cycles for the measured instructions (as they can certainly be implemented in a fully pipelined datapath). The performance cost is between one and two orders of magnitudes (in base 10). It is likely that some of it can be regained by more optimized emulation sequences but emulation will always have a non-zero cost (except if we emulate a vector instruction which is actually an alias to another existing instruction).
Conclusion(s)
Hopefully rvv-intrinsic-emulation (RIE) can be useful to at least two groups of people:
developers of vector extension who wants to develop proof of concepts for extension projects without having to go down to the assembly level (and without requiring toolchain support).
(early) application developers who want to play with future extensions before they become available in hardware or before they are supported by their toolchain.
The second use case certainly becomes mute when standard compilers (e.g. gcc, clang/llvm) implement their own lowering of unsupported instructions.
RIE is open source and available on github: https://github.com/nibrunie/rvv-intrinsic-emulation. Feel free to check it out, modify/extend it and provide any constructive feedback.
With RIE it becomes easier to develop proof of concept use cases for new vector extensions. Moreover the proof of concepts can seamlessly transition to “the real thing” once the extension becomes available in compilers and hardware (since RIE aims at exposing the intrinsics following the standard API).
Disclaimer: Google’s Antigravity IDE and Anthropic’s Claude Opus and Sonnet LLMs were used for part of the development of RIE.
Reference(s)
Source repository: https://github.com/nibrunie/rvv-intrinsic-emulation
RVV intrinsics documentation: https://github.com/riscv-non-isa/rvv-intrinsic-doc
Zvzip draft specification: https://github.com/riscv/riscv-isa-manual/pull/2529
Zvdot4a8i draft specification: https://github.com/riscv/riscv-isa-manual/pull/2576
Zvabd draft specification: riscv/integer-vector-absolute-difference/pull/1
rvv-intrinsic-emulation, RIE, name might change, don’t get too attached

