Post updated on Jan 10th 2024 to fix an error on vl setting pointed out by -.- (Thank you).
In a previous blog post series, RISC-V Vector in a Nutshell, we introduced the basics of RISC-V vector extension (RVV). If you are not familiar with this extension, you should consult this blog series first:
Once you know the basic of RVV, the next step is to try it out. The most basic way is assembly programming. Current GCC and clang toolchains can easily assemble asm1 programs which make use of RVV instructions (as long as support for RVV is enabled, often by listing the v
extension in the -march
ISA string).
RVV programming with C intrinsics
But there is a somewhat easier and more modern way to program using RISC-V Vector directly in the C/C++: RVV intrinsics. RVV instructions can be called within a C/C++ program directly through intrinsics: low-level functions exposed by the compiler. Each of those low-level functions has almost a one-to-one mapping with the corresponding RVV instruction making low-level RVV programming accessible without assembly expertise.
In short, an intrinsic is a low-level function generally defined directly in a compiler (no need to link a specific library) which exposes a single instruction or a short sequence of instructions into a higher level language (higher level than assembly).
A first example of RVV intrinsic
The following is an example of RVV intrinsics to perform an integer vector addition, vadd.vv
, between two vector register groups of one vector register (m1
), of 32-bit elements (i32
).
vint32m1_t __riscv_vadd_vv_i32m1(vint32m1_t, vint32m1_t, size_t);
Intrinsic naming follow a regular scheme summarized in the diagram below:
If you are familiar with RVV you will already have noted that the function name also contains a description of the vector configuration (element size, group multiplier LMUL) which are not generally encoded in a RVV instruction opcode. Moreover, most intrinsics expect a vector length parameter.
This simplify the programming model: all the information about the vector configuration for an operation are embedded in the intrinsics. This include the tail and mask policies: the intrinsic suffix encodes this piece of information. For example no suffix means unmasked and tail agnostic and _tu
means unmasked and tail undisturbed policies.
__riscv_vadd_vv_i32m1_tu(vint32m1_t, vint32m1_t, size_t);
Embedding all those configuration items puts more burden on the compiler which has to generate and optimize the sequence of vector configuration instructions (vset*
) and vector operations: factorizing local vector configuration when possible.
There are a lot of RVV intrinsics, too many to count. The specification and documentation of RVV intrinsics is an on-going effort by RVIA (RISC-V International), with Yueh-Ting (eop) Chen being one of the main contributors. The project can be found here: https://github.com/riscv-non-isa/rvv-intrinsic-doc/tree/main.
The intrinsics require compiler support: LLVM 17 and gcc trunk (dev branch) supports the latest version, v0.12, of RVV intrinsic specification.
Note: There exists a very useful RVV intrinsics viewer https://dzaima.github.io/intrinsics-viewer/ (link suggested by
)
Why so many intrinsics ?
Many vector types
First of all, there are a lot of possible types for intrinsics operands/destinations (documented in the section Type System of the intrinsics documentation). These possibilities correspond to the valid cross combination of:
element type (floating-point, signed integer, unsigned integer, boolean, …)
element-width (8, 16, 32, or 64 bits)
vector group multiplier (1, 2, 4, 8 or the fractional 1/2, 1/4, 1/8)
Here are a few examples:
// type for one-register groups of signed 8-bit integers
vint8m1_t
// type for 4-register groups of unsigned 8-bit integers
vuint8m4_t
// type for 2-register groups of 16-bit floating-point values (half)
vfloat16m2_t
One intrinsic for each pair (operation, type)
For each vector operation, one explicit RVV intrinsic is defined for each specific set of input/destination types for inputs and destination.
The number of possible signature types creates a very large set of intrinsics for every single RVV instruction. For example, here is a very small subset of intrinsics to perform integer vector addition between two vectors:
// addition of register groups of one vector reg of 32-bit elements
vint32m1_t __riscv_vadd_vv_i32m1(vint32m1_t, vint32m1_t, size_t);
// addition of register groups of two vector regs of 32-bit elements
vint32m2_t __riscv_vadd_vv_i32m2(vint32m2_t, vint32m2_t, size_t);
// addition of register groups of two vector regs of 64-bit elements
vint64m2_t __riscv_vadd_vv_i64m2(vint64m2_t, vint64m2_t, size_t);
All those intrinsics map to a single vector instructions, vadd.vv
.
Note: more precisely, most operational intrinsics describe the sequence of a vset* instruction to define the vector configuration and a vector instruction to perform the actual operation. A toy example shows that the compiler (clang trunk in this case) optimizes away the redundant
vset*
in a sequence of instructions which share the same vector configuration.
Implicit (overloaded) API
Fortunately the intrinsic API also provides an implicit (overloaded) name scheme (doc) which allows the programmer to use a single overloaded function (e.g. __riscv_vadd
) to call all the EEW/LMUL variants. They are some limitations to this scheme, which can be found here; for example there is no overloaded function for intrinsics with only scalar types, which means there is no overloaded function for unmasked unit-strided load.
One intrinsic for each mask/tail configurations
As we have seen previously, base intrinsics can be extended by an optional suffix to indicate if the operation is masked/unmasked and with which policy for unactive elements and what is the tail policy. The available suffixes are detailed here.
They are 6 possible suffixes (including the default empty suffix). For example the following is the intrinsic for a 16-bit element masked unit-strided vector load:
vfloat16mf4_t __riscv_vle16_v_f16mf4_m(vbool64_t vm,
const _Float16 *rs1,
size_t vl);
These suffixes also exist in the implicit naming scheme.
Are all the intrinsics actual RVV instructions ?
Some of the intrinsics do not necessarily map to real RVV instructions. For example selecting a single register out of a multi-register vector group:
vint8m1_t __riscv_vget_v_i8m8_i8m1(vint8m8_t src, size_t index);
Similarly re-interpreting a vector of unsigned 32-bit elements as a vector of single precision 32-bit elements requires an intrinsics:
vfloat32m1_t __riscv_vreinterpret_v_f32m1_u32m1 (vuint32m1_t);
The underlying data does not change within the register group, it is just re-interpreted differently for the next operations. This is due to the fact that the RVV C intrinsics type system distinguish multiple types of 32-bit element, which is not the case in assembly: vadd.vv
and vfadd.vv
can be executed seamlessly on the same inputs or one on the result of the other without requiring any extra operation in between (even if the cases where it actually make sense may be few).
Note: In a RVV assembly program, the vector configuration of an instruction (SEW, LMUL, vl) generally depends on its context in the program, in particular it depends on the previous vector configuration change executed before it in program order. In a C program using RVV intrinsics, the vector configuration is the property of a variable / expression and does not depend on the position of the expression in the C program order.
Example: Vector-Add with RVV Intrinsics
Let’s implement the basic vector example, a 32-bit floating-point vector-add, using the intrinsics. In this vector-add we will define the function:
/** vector addition
*
* @param dst address of destination array
* @param lhs address of left hand side operand array
* @param rhs address of right hand side operand array
* @param avl application vector length (array size)
*/
void vector_add(float *dst,
float *lhs,
float *rhs,
size_t avl);
vector_add
performs the element-wise addition of two arrays, lhs
and rhs
, each with avl
single precision (float
) elements; finally the results are stored in the array dst
.
void vector_add(float *dst,
float *lhs,
float *rhs,
size_t avl)
{
for (size_t vl; avl > 0; avl -= vl, lhs += vl, rhs += vl, dst += vl)
{
// compute the number of elements which are going to be
// processed in this iteration of loop body.
// this number corresponds to the vector length (vl)
// and is evaluated from avl (application vector length)
vl = __riscv_vsetvl_e32m1(avl);
// loading operands
vfloat32m1_t vec_src_lhs = __riscv_vle32_v_f32m1(lhs, vl);
vfloat32m1_t vec_src_rhs = __riscv_vle32_v_f32m1(rhs, vl);
// actual vector addition
vfloat32m1_t vec_acc = __riscv_vfadd_vv_f32m1(vec_src_lhs,
vec_src_rhs,
vl);
// storing results
__riscv_vse32_v_f32m1(dst, vec_acc, vl);
}
}
The method used here is straightforward:
The main loop iterates over the input vectors to compute the vector addition of
avl
lements.avl
is used as the counter of remaining elements.In each iteration:
We stop if we detect there are no more elements to compute (said otherwise we start a new iteration if and only if
avl > 0
)We start by computing the number of elements,
vl
, which will be processed during this iteration.vl = __riscv_vsetvl_e32m1(avl);
We load
vl
elements from bothlhs
andrhs
vfloat32m1_t vec_src_lhs = __riscv_vle32_v_f32m1(lhs, vl); vfloat32m1_t vec_src_rhs = __riscv_vle32_v_f32m1(rhs, vl);
We perform element-wise additions of
vl
elementsvfloat32m1_t vec_acc = __riscv_vfadd_vv_f32m1(vec_src_lhs, vec_src_rhs, vl);
We store the
vl
results into dst__riscv_vse32_v_f32m1(dst, vec_acc, vl);
We update
avl
by subtractingvl
from it, and we update the source and destination pointersavl -= vl, lhs += vl, rhs += vl, dst += vl
The likely behavior is depicted by the diagram below:
VLMAX
elements will be processed in each iterations (VLMAX
elements from lhs
and rhs
will be added to form VLMAX
elements in dst)
, except the last one which will processed either VLMAX
elements if the original avl value was a multiple of VLMAX
or avl % VLMAX
(modulo operation).
Note: As we will later explain, RVV 1.0 specification allows for less than VLMAX elements to be processed in this case. For simplicity sake we assume an implementation which processes the maximum legal number of elements in each loop iteration. Legal behaviors include some variance when the remaining number of elements goes strictly below
2 * VLMAX
.
Impact of VLEN on the result of vsetvl
Let us come back on the evaluation of the local vector length at the start of the loop iteration:
vl = __riscv_vsetvl_e32m1(avl);
The value returned by __riscv_vsetvl_e32m1
depends on two things avl
but also VLMAX
(which is directly related to VLEN
): if avl
is greater than VLMAX
then a truncated value is returned (the truncated value is less than or equal to VLMAX, c.f. the spec) else avl
is returned. RVV 1.0 ensures that vl=0 can not be returned if avl >= VLMAX
2, so forward progress is ensure but the actual amount of progress is implementation dependent. This post assumes only one of many legal RVV 1.0 behaviors: VLMAX
is returned by __riscv_vsetvl_e32m1
if AVL >= VLMAX
.
VLMAX = VLEN * LMUL / SEW
In our case, SEW=32 (e32
) and LMUL=1 (m1
), so we get VLMAX = VLEN / 32
. The actual bound on the value returned by vsetvl
depends on VLEN
: the larger the VLEN
the more elements are computed in each loop iterations. This is the definition of vector length architecture with a vector length agnostic program: the execution will adapt to the actual architectural value of VLEN
.
Note: The current compilers emit
__riscv_vsetvl_e32m1
as avsetvli
(immediate value for SEW and LMUL) rather than the genericvsetvl
, exploiting the fact that both the element width and group multiplier are known at compile time and can be embedded in the opcode. RVV intrinsics only offer a generic family of functions,__riscv_vsetvl_<ew><lmul>
, which can be compiled tovsetvli
orvsetivli
depending on the static/dynamic character of the vector length value (vtype value is always statically encoded in the function name).
Benchmarking
Building the code and looking at the assembly
You can easily build the code using a recent compiler on godbolt.org compiler explorer: https://godbolt.org/z/x1q8qvdhr.
You will get the following assembly:
vector_add: # @vector_add
beqz a3, .LBB0_2
.LBB0_1: # =>This Inner Loop Header: Depth=1
vsetvli a4, a3, e32, m1, ta, ma
vle32.v v8, (a1)
vle32.v v9, (a2)
vfadd.vv v8, v8, v9
vse32.v v8, (a0)
sub a3, a3, a4
slli a4, a4, 2
add a1, a1, a4
add a2, a2, a4
add a0, a0, a4
bnez a3, .LBB0_1
.LBB0_2:
ret
This assembly can easily be mapped to our C intrinsics and the register values at the start of the function are a direct mapping of the ABI specification:
a0
contains the destination pointerdst
a1
contains the first source pointerlhs
a2
contains the second source pointerrhs
a3
containsavl
Note: for the same example using the implicit (overloaded) functions, you can check out https://godbolt.org/z/vYc3GMGe4
Simple benchmark
We are going to build a very simple benchmark which is going to evaluate and display how many instructions are executed in our vector_add
function. For that purpose we rely on a RISC-V performance counter named instret
which counts the number of instructions retired. This is not a very good way to evaluate a program’s performance but it will suffice for now.
// file: bench_vector_add.c
#include <stdio.h>
#include <stdlib.h>
/** return the value of the instret counter
*
* The instret counter counts the number of retired (executed) instructions.
*/
unsigned long read_instret(void)
{
unsigned long instret;
asm volatile ("rdinstret %0" : "=r" (instret));
return instret;
}
// Defining a default size fot the inputs and output array
// (can be overloaded during compilation with -DARRAY_SIZE=<value>)
#ifndef ARRAY_SIZE
#define ARRAY_SIZE 1024
#endif
float lhs[ARRAY_SIZE];
float rhs[ARRAY_SIZE];
float dst[ARRAY_SIZE] = {0.f};
int main(void) {
int i;
// random initialization of the input arrays
for (i = 0; i < ARRAY_SIZE; ++i) {
lhs[i] = rand() / (float) RAND_MAX;
rhs[i] = rand() / (float) RAND_MAX;
}
unsigned long start, stop;
start = read_instret();
vector_add(dst, lhs, rhs, ARRAY_SIZE);
stop = read_instret();
printf("vector_add_intrinsics used %d instruction(s) to evaluate %d element(s).\n", stop - start, ARRAY_SIZE);
return 0;
}
Building our benchmark
The source files of this example, alongside a Dockerfile
to build a simple RISC-V development environment, can be found in https://github.com/nibrunie/rvv-examples/releases/tag/v0.1.0.
We will build our benchmark in two steps, first the vector_add_intrinsics.c
source file:
# building object file with intrinsic function
riscv64-unknown-elf-gcc -O2 -march=rv64gcv -c \
-o vector_add_intrinsics.o
vector_add_intrinsics.c
Then we build and link our benchmark source file
# building benchmark
riscv64-unknown-elf-gcc -march=rv64gcv \
bench_vector_add.c \
vector_add_intrinsics.o \
-O2 -o bench-0_vector_add
Running and comparing VLEN values
We are going to use spike, RISC-V Instruction Set Simulator (ISS), to run our program.
Spike supports most RISC-V extension (including RVV) and is highly configurable when it comes to the RISC-V architectural parameters. In particular RVV’s VLEN and ELEN can be configured:
spike --isa=rv64gcv_zicntr_zihpm --varch=vlen:128,elen:64 /opt/riscv/riscv64-unknown-elf/bin/pk64 bench-0_vector_add
It becomes easy to measure the number of retired instructions for different values of VLEN. The result is plotted below:
This is one of the advantages of RVV (and other vector ISAs): a single binary program can be executed by implementations with different values of VLEN. The result are identical but the executions differs; for example in the number of retired execution.
Note: Implementors can chose different architectural parameter values depending on the metrics they want to optimize for (a larger VLEN implies a wider vector register file and thus a larger silicon area cost).
We can see that when VLEN increases, the number of “executed” instruction reduces. This was expected: the number of iterations of our loop that will be executed depends on VLMAX which in turns depends on VLEN. The larger the VLEN the larger the vector length value returned by the vsetvli
instruction and the more elements will be loaded/added/stored by a single vle32
/vfadd
/vse32
.
For VLEN=4096
, VLMAX=4096/32=128
(since we manipulate 32-bit single precision elements). Theoretically only 1024/128=8
iterations of our loop body are required to produce the full 1024 elements of the dst
array. The assembly shows 11 instructions in the loop body, 11*8=88, we are not too far off from the 97 instructions retired during the benchmark execution for VLEN=4096
.
We can do a similar benchmark with various values of LMUL, this requires some source code modification. For example, implementing LMUL=4
looks like:
void vector_add(float *dst,
float *lhs,
float *rhs,
size_t avl)
{
for (size_t vl; avl > 0; avl -= vl, lhs += vl, rhs += vl, dst += vl)
{
// compute loop body vector length from avl
// (application vector length)
vl = __riscv_vsetvl_e32m4(avl);
// loading operands
vfloat32m4_t vec_src_lhs = __riscv_vle32_v_f32m4(lhs, vl);
vfloat32m4_t vec_src_rhs = __riscv_vle32_v_f32m4(rhs, vl);
// actual vector addition
vfloat32m4_t vec_acc = __riscv_vfadd_vv_f32m4(vec_src_lhs,
vec_src_rhs,
vl);
// storing results
__riscv_vse32_v_f32m4(dst, vec_acc, vl);
}
}
An interesting fact is that the number of retired instruction for VLEN=512;LMUL=1 is exactly equal to the number of retired instruction for VLEN=128;LMUL=4 (713 instructions in both cases). This is no coincidence: the values of VLMAX for both cases are equal (512 * 1 / 32 = 16 and 128 * 4 / 32 = 16). The loop code being equal, the numbers of retired instructions match.
Conclusion
RVV intrinsic offers a higher level API to program using RISC-V Vector instructions (compared to assembly programming): the developper has access to the C/C++ type systems, and the optimizing capabilities of modern compiler (including instruction selection optimization, scheduling, register allocation). The on-going specification includes an extensive documentation and support is available in recent versions of compilers (LLVM and GCC) making the intrinsics a great tool to access RVV programming.
Note on Jan 10th 2024 update: A previous version of this post stated:
if
avl
is greater thanVLMAX
thenVLMAX
is returned elseavl
is returned.
As pointed out by -.- in a comment, this was incorrect as the RVV spec allows some implementation freedom there (see RVV 1.0 specification Section 6.3: Constrains on setting vl)
References
Compiler explorer link with the vector-add function
Source snapshot of rvv-examples with basic vector add benchmark: https://github.com/nibrunie/rvv-examples/releases/tag/v0.1.0
github repository with the simple vector-add example: https://github.com/nibrunie/rvv-examples/tree/main/src/vector_add
Intrinsic viewer https://dzaima.github.io/intrinsics-viewer/ (suggested by
)
assembly
since RVV 1.0 mandates vset*vl* to return vl
such that ceil(AVL / 2) ≤ vl ≤ VLMAX
when VLMAX ≤ AVL < (2 * VLMAX)
and to return VLMAX
when AVL ≥ (2 * VLMAX)
> if avl is greater than VLMAX then VLMAX is returned
That'd be logical, right? Too bad the RVV spec likes to throw curveballs at unsuspecting developers.
https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#constraints-on-setting-vl
I recommend https://dzaima.github.io/intrinsics-viewer/ as a reference for the intrinsics.
I ran the float sum benchmark with 10000 elements and rdcycle on a C920, here are the results:
scalar: 27000 cycles
LMUL=1: 10792 cycles
LMUL=2: 9337 cycles
LMUL=4: 8702 cycles
LMUL=8: 10553 cycles
You can see how LMUL>1 basically acts as loop unrolling, as the C920 has DLEN<=VLEN. The reason LMUL=8 is slower than LMUL=4 is, presumably, because the core can issue one 512 bit load and one 512 bit store in parallel, but with LMUL=8 it can't (or rather doesn't) interleave the load stores. I expect future implementations to not suffer from this problem.