RISC-V Vector Programming in C with Intrinsics

RVV for those who find assembly too low level

Nov 23, 2023

Post updated on Jan 10th 2024 to fix an error on vl setting pointed out by -.- (Thank you).

In a previous blog post series, RISC-V Vector in a Nutshell, we introduced the basics of RISC-V vector extension (RVV). If you are not familiar with this extension, you should consult this blog series first:

RISC-V Vector in a Nutshell

Fprox

February 3, 2023

Read full story

Once you know the basic of RVV, the next step is to try it out. The most basic way is assembly programming. Current GCC and clang toolchains can easily assemble asm1 programs which make use of RVV instructions (as long as support for RVV is enabled, often by listing the v extension in the -march ISA string).

RVV programming with C intrinsics

But there is a somewhat easier and more modern way to program using RISC-V Vector directly in the C/C++: RVV intrinsics. RVV instructions can be called within a C/C++ program directly through intrinsics: low-level functions exposed by the compiler. Each of those low-level functions has almost a one-to-one mapping with the corresponding RVV instruction making low-level RVV programming accessible without assembly expertise.

In short, an intrinsic is a low-level function generally defined directly in a compiler (no need to link a specific library) which exposes a single instruction or a short sequence of instructions into a higher level language (higher level than assembly).

A first example of RVV intrinsic

The following is an example of RVV intrinsics to perform an integer vector addition, vadd.vv, between two vector register groups of one vector register (m1), of 32-bit elements (i32).

vint32m1_t __riscv_vadd_vv_i32m1(vint32m1_t, vint32m1_t, size_t);

Intrinsic naming follow a regular scheme summarized in the diagram below:

If you are familiar with RVV you will already have noted that the function name also contains a description of the vector configuration (element size, group multiplier LMUL) which are not generally encoded in a RVV instruction opcode. Moreover, most intrinsics expect a vector length parameter.

This simplify the programming model: all the information about the vector configuration for an operation are embedded in the intrinsics. This include the tail and mask policies: the intrinsic suffix encodes this piece of information. For example no suffix means unmasked and tail agnostic and _tu means unmasked and tail undisturbed policies.

__riscv_vadd_vv_i32m1_tu(vint32m1_t, vint32m1_t, size_t);

Embedding all those configuration items puts more burden on the compiler which has to generate and optimize the sequence of vector configuration instructions (vset*) and vector operations: factorizing local vector configuration when possible.

There are a lot of RVV intrinsics, too many to count. The specification and documentation of RVV intrinsics is an on-going effort by RVIA (RISC-V International), with Yueh-Ting (eop) Chen being one of the main contributors. The project can be found here: https://github.com/riscv-non-isa/rvv-intrinsic-doc/tree/main.

The intrinsics require compiler support: LLVM 17 and gcc trunk (dev branch) supports the latest version, v0.12, of RVV intrinsic specification.

Note: There exists a very useful RVV intrinsics viewer https://dzaima.github.io/intrinsics-viewer/ (link suggested by
camel-cdr
)

Why so many intrinsics ?

Many vector types

First of all, there are a lot of possible types for intrinsics operands/destinations (documented in the section Type System of the intrinsics documentation). These possibilities correspond to the valid cross combination of:

element type (floating-point, signed integer, unsigned integer, boolean, …)
element-width (8, 16, 32, or 64 bits)
vector group multiplier (1, 2, 4, 8 or the fractional 1/2, 1/4, 1/8)

Here are a few examples:

// type for one-register groups of signed 8-bit integers
vint8m1_t 
// type for 4-register groups of unsigned 8-bit integers
vuint8m4_t
// type for 2-register groups of 16-bit floating-point values (half)
vfloat16m2_t

One intrinsic for each pair (operation, type)

For each vector operation, one explicit RVV intrinsic is defined for each specific set of input/destination types for inputs and destination.

The number of possible signature types creates a very large set of intrinsics for every single RVV instruction. For example, here is a very small subset of intrinsics to perform integer vector addition between two vectors:

// addition of register groups of one vector reg of 32-bit elements
vint32m1_t __riscv_vadd_vv_i32m1(vint32m1_t, vint32m1_t, size_t);
// addition of register groups of two vector regs of 32-bit elements
vint32m2_t __riscv_vadd_vv_i32m2(vint32m2_t, vint32m2_t, size_t);
// addition of register groups of two vector regs of 64-bit elements
vint64m2_t __riscv_vadd_vv_i64m2(vint64m2_t, vint64m2_t, size_t);

All those intrinsics map to a single vector instructions, vadd.vv.

Note: more precisely, most operational intrinsics describe the sequence of a vset* instruction to define the vector configuration and a vector instruction to perform the actual operation. A toy example shows that the compiler (clang trunk in this case) optimizes away the redundant vset* in a sequence of instructions which share the same vector configuration.

Implicit (overloaded) API

Fortunately the intrinsic API also provides an implicit (overloaded) name scheme (doc) which allows the programmer to use a single overloaded function (e.g. __riscv_vadd) to call all the EEW/LMUL variants. They are some limitations to this scheme, which can be found here; for example there is no overloaded function for intrinsics with only scalar types, which means there is no overloaded function for unmasked unit-strided load.

One intrinsic for each mask/tail configurations

As we have seen previously, base intrinsics can be extended by an optional suffix to indicate if the operation is masked/unmasked and with which policy for unactive elements and what is the tail policy. The available suffixes are detailed here.

They are 6 possible suffixes (including the default empty suffix). For example the following is the intrinsic for a 16-bit element masked unit-strided vector load:

vfloat16mf4_t __riscv_vle16_v_f16mf4_m(vbool64_t vm,
                                       const _Float16 *rs1,
                                       size_t vl);

These suffixes also exist in the implicit naming scheme.

Are all the intrinsics actual RVV instructions ?

Some of the intrinsics do not necessarily map to real RVV instructions. For example selecting a single register out of a multi-register vector group:

vint8m1_t __riscv_vget_v_i8m8_i8m1(vint8m8_t src, size_t index);

Similarly re-interpreting a vector of unsigned 32-bit elements as a vector of single precision 32-bit elements requires an intrinsics:

vfloat32m1_t __riscv_vreinterpret_v_f32m1_u32m1 (vuint32m1_t);

The underlying data does not change within the register group, it is just re-interpreted differently for the next operations. This is due to the fact that the RVV C intrinsics type system distinguish multiple types of 32-bit element, which is not the case in assembly: vadd.vv and vfadd.vv can be executed seamlessly on the same inputs or one on the result of the other without requiring any extra operation in between (even if the cases where it actually make sense may be few).

Note: In a RVV assembly program, the vector configuration of an instruction (SEW, LMUL, vl) generally depends on its context in the program, in particular it depends on the previous vector configuration change executed before it in program order. In a C program using RVV intrinsics, the vector configuration is the property of a variable / expression and does not depend on the position of the expression in the C program order.

Example: Vector-Add with RVV Intrinsics

Let’s implement the basic vector example, a 32-bit floating-point vector-add, using the intrinsics. In this vector-add we will define the function:

/** vector addition
 *
 * @param dst address of destination array
 * @param lhs address of left hand side operand array
 * @param rhs address of right hand side operand array
 * @param avl application vector length (array size)
 */
void vector_add(float *dst,
                float *lhs,
                float *rhs,
                size_t avl);

vector_add performs the element-wise addition of two arrays, lhs and rhs, each with avl single precision (float) elements; finally the results are stored in the array dst.

void vector_add(float *dst,
                float *lhs,
                float *rhs,
                size_t avl)
{
    for (size_t vl; avl > 0; avl -= vl, lhs += vl, rhs += vl, dst += vl)
    {
        // compute the number of elements which are going to be
        // processed in this iteration of loop body.
        // this number corresponds to the vector length (vl)
        // and is evaluated from avl (application vector length)
        vl = __riscv_vsetvl_e32m1(avl);
        // loading operands
        vfloat32m1_t vec_src_lhs = __riscv_vle32_v_f32m1(lhs, vl);
        vfloat32m1_t vec_src_rhs = __riscv_vle32_v_f32m1(rhs, vl);
        // actual vector addition
        vfloat32m1_t vec_acc = __riscv_vfadd_vv_f32m1(vec_src_lhs,
                                                      vec_src_rhs,
                                                      vl);
        // storing results
        __riscv_vse32_v_f32m1(dst, vec_acc, vl);
    }
}

The method used here is straightforward:

The main loop iterates over the input vectors to compute the vector addition of avl lements. avl is used as the counter of remaining elements.
In each iteration:
- We stop if we detect there are no more elements to compute (said otherwise we start a new iteration if and only if avl > 0)
- We start by computing the number of elements, vl, which will be processed during this iteration.
```
        vl = __riscv_vsetvl_e32m1(avl);
```
- We load vl elements from both lhs and rhs
```
vfloat32m1_t vec_src_lhs = __riscv_vle32_v_f32m1(lhs, vl);
vfloat32m1_t vec_src_rhs = __riscv_vle32_v_f32m1(rhs, vl);
```
- We perform element-wise additions of vl elements
```
vfloat32m1_t vec_acc = __riscv_vfadd_vv_f32m1(vec_src_lhs,
                                             vec_src_rhs,
                                             vl);
```
- We store the vl results into dst
```
        __riscv_vse32_v_f32m1(dst, vec_acc, vl);
```
- We update avl by subtracting vl from it, and we update the source and destination pointers
```
avl -= vl, lhs += vl, rhs += vl, dst += vl
```

The likely behavior is depicted by the diagram below:

VLMAX elements will be processed in each iterations (VLMAX elements from lhs and rhs will be added to form VLMAX elements in dst), except the last one which will processed either VLMAX elements if the original avl value was a multiple of VLMAX or avl % VLMAX (modulo operation).

Note: As we will later explain, RVV 1.0 specification allows for less than VLMAX elements to be processed in this case. For simplicity sake we assume an implementation which processes the maximum legal number of elements in each loop iteration. Legal behaviors include some variance when the remaining number of elements goes strictly below 2 * VLMAX.

Impact of VLEN on the result of vsetvl

Let us come back on the evaluation of the local vector length at the start of the loop iteration:

        vl = __riscv_vsetvl_e32m1(avl);

The value returned by __riscv_vsetvl_e32m1 depends on two things avl but also VLMAX (which is directly related to VLEN): if avl is greater than VLMAX then a truncated value is returned (the truncated value is less than or equal to VLMAX, c.f. the spec) else avl is returned. RVV 1.0 ensures that vl=0 can not be returned if avl >= VLMAX2, so forward progress is ensure but the actual amount of progress is implementation dependent. This post assumes only one of many legal RVV 1.0 behaviors: VLMAX is returned by __riscv_vsetvl_e32m1 if AVL >= VLMAX.

VLMAX = VLEN * LMUL / SEW

In our case, SEW=32 (e32) and LMUL=1 (m1), so we get VLMAX = VLEN / 32. The actual bound on the value returned by vsetvl depends on VLEN: the larger the VLEN the more elements are computed in each loop iterations. This is the definition of vector length architecture with a vector length agnostic program: the execution will adapt to the actual architectural value of VLEN.

Note: The current compilers emit __riscv_vsetvl_e32m1 as a vsetvli (immediate value for SEW and LMUL) rather than the generic vsetvl, exploiting the fact that both the element width and group multiplier are known at compile time and can be embedded in the opcode. RVV intrinsics only offer a generic family of functions, __riscv_vsetvl_<ew><lmul>, which can be compiled to vsetvli or vsetivli depending on the static/dynamic character of the vector length value (vtype value is always statically encoded in the function name).

Benchmarking

Building the code and looking at the assembly

You can easily build the code using a recent compiler on godbolt.org compiler explorer: https://godbolt.org/z/x1q8qvdhr.

You will get the following assembly:

vector_add:                             # @vector_add
        beqz    a3, .LBB0_2
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        vsetvli a4, a3, e32, m1, ta, ma
        vle32.v v8, (a1)
        vle32.v v9, (a2)
        vfadd.vv        v8, v8, v9
        vse32.v v8, (a0)
        sub     a3, a3, a4
        slli    a4, a4, 2
        add     a1, a1, a4
        add     a2, a2, a4
        add     a0, a0, a4
        bnez    a3, .LBB0_1
.LBB0_2:
        ret

This assembly can easily be mapped to our C intrinsics and the register values at the start of the function are a direct mapping of the ABI specification:

a0 contains the destination pointer dst
a1 contains the first source pointer lhs
a2 contains the second source pointer rhs
a3 contains avl

Note: for the same example using the implicit (overloaded) functions, you can check out https://godbolt.org/z/vYc3GMGe4

Simple benchmark

We are going to build a very simple benchmark which is going to evaluate and display how many instructions are executed in our vector_add function. For that purpose we rely on a RISC-V performance counter named instret which counts the number of instructions retired. This is not a very good way to evaluate a program’s performance but it will suffice for now.

// file: bench_vector_add.c
#include <stdio.h>
#include <stdlib.h>

/** return the value of the instret counter
 *
 *  The instret counter counts the number of retired (executed) instructions.
*/
unsigned long read_instret(void)
{
  unsigned long instret;
  asm volatile ("rdinstret %0" : "=r" (instret));
  return instret;
}

// Defining a default size fot the inputs and output array
// (can be overloaded during compilation with -DARRAY_SIZE=<value>)
#ifndef ARRAY_SIZE
#define ARRAY_SIZE 1024
#endif

float lhs[ARRAY_SIZE];
float rhs[ARRAY_SIZE];
float dst[ARRAY_SIZE] = {0.f};


int main(void) {
    int i;
    // random initialization of the input arrays
    for (i = 0; i < ARRAY_SIZE; ++i) {
        lhs[i] = rand() / (float) RAND_MAX;
        rhs[i] = rand() / (float) RAND_MAX;
    }

    unsigned long start, stop;
    start = read_instret();
    vector_add(dst, lhs, rhs, ARRAY_SIZE);
    stop = read_instret();

    printf("vector_add_intrinsics used %d instruction(s) to evaluate %d element(s).\n", stop - start, ARRAY_SIZE);

    return 0;
}

Building our benchmark

The source files of this example, alongside a Dockerfile to build a simple RISC-V development environment, can be found in https://github.com/nibrunie/rvv-examples/releases/tag/v0.1.0.

We will build our benchmark in two steps, first the vector_add_intrinsics.c source file:

# building object file with intrinsic function
riscv64-unknown-elf-gcc  -O2 -march=rv64gcv -c \
       -o vector_add_intrinsics.o 
       vector_add_intrinsics.c

Then we build and link our benchmark source file

# building benchmark
riscv64-unknown-elf-gcc  -march=rv64gcv \
                         bench_vector_add.c \
                         vector_add_intrinsics.o \
                         -O2 -o bench-0_vector_add

Running and comparing VLEN values

We are going to use spike, RISC-V Instruction Set Simulator (ISS), to run our program.

Spike supports most RISC-V extension (including RVV) and is highly configurable when it comes to the RISC-V architectural parameters. In particular RVV’s VLEN and ELEN can be configured:

spike --isa=rv64gcv_zicntr_zihpm --varch=vlen:128,elen:64 /opt/riscv/riscv64-unknown-elf/bin/pk64  bench-0_vector_add

It becomes easy to measure the number of retired instructions for different values of VLEN. The result is plotted below:

This is one of the advantages of RVV (and other vector ISAs): a single binary program can be executed by implementations with different values of VLEN. The result are identical but the executions differs; for example in the number of retired execution.

Note: Implementors can chose different architectural parameter values depending on the metrics they want to optimize for (a larger VLEN implies a wider vector register file and thus a larger silicon area cost).

We can see that when VLEN increases, the number of “executed” instruction reduces. This was expected: the number of iterations of our loop that will be executed depends on VLMAX which in turns depends on VLEN. The larger the VLEN the larger the vector length value returned by the vsetvli instruction and the more elements will be loaded/added/stored by a single vle32/vfadd/vse32.

For VLEN=4096, VLMAX=4096/32=128 (since we manipulate 32-bit single precision elements). Theoretically only 1024/128=8 iterations of our loop body are required to produce the full 1024 elements of the dst array. The assembly shows 11 instructions in the loop body, 11*8=88, we are not too far off from the 97 instructions retired during the benchmark execution for VLEN=4096.

We can do a similar benchmark with various values of LMUL, this requires some source code modification. For example, implementing LMUL=4 looks like:

void vector_add(float *dst,
                float *lhs,
                float *rhs,
                size_t avl)
{
    for (size_t vl; avl > 0; avl -= vl, lhs += vl, rhs += vl, dst += vl)
    {
        // compute loop body vector length from avl
        // (application vector length)
        vl = __riscv_vsetvl_e32m4(avl);
        // loading operands
        vfloat32m4_t vec_src_lhs = __riscv_vle32_v_f32m4(lhs, vl);
        vfloat32m4_t vec_src_rhs = __riscv_vle32_v_f32m4(rhs, vl);
        // actual vector addition
        vfloat32m4_t vec_acc = __riscv_vfadd_vv_f32m4(vec_src_lhs,
                                                      vec_src_rhs,
                                                      vl);
        // storing results
        __riscv_vse32_v_f32m4(dst, vec_acc, vl);
    }
}

An interesting fact is that the number of retired instruction for VLEN=512;LMUL=1 is exactly equal to the number of retired instruction for VLEN=128;LMUL=4 (713 instructions in both cases). This is no coincidence: the values of VLMAX for both cases are equal (512 * 1 / 32 = 16 and 128 * 4 / 32 = 16). The loop code being equal, the numbers of retired instructions match.

Conclusion

RVV intrinsic offers a higher level API to program using RISC-V Vector instructions (compared to assembly programming): the developper has access to the C/C++ type systems, and the optimizing capabilities of modern compiler (including instruction selection optimization, scheduling, register allocation). The on-going specification includes an extensive documentation and support is available in recent versions of compilers (LLVM and GCC) making the intrinsics a great tool to access RVV programming.

Thank you for reading Fprox’s Substack. If you think it could interest other feel free to share it.

Note on Jan 10th 2024 update: A previous version of this post stated:

if avl is greater than VLMAX then VLMAX is returned else avl is returned.

As pointed out by -.- in a comment, this was incorrect as the RVV spec allows some implementation freedom there (see RVV 1.0 specification Section 6.3: Constrains on setting vl)

References

https://github.com/riscv-non-isa/rvv-intrinsic-doc
Compiler explorer link with the vector-add function
Source snapshot of rvv-examples with basic vector add benchmark: https://github.com/nibrunie/rvv-examples/releases/tag/v0.1.0
github repository with the simple vector-add example: https://github.com/nibrunie/rvv-examples/tree/main/src/vector_add
Intrinsic viewer https://dzaima.github.io/intrinsics-viewer/ (suggested by
camel-cdr
)

assembly

since RVV 1.0 mandates vset*vl* to return vl such that ceil(AVL / 2) ≤ vl ≤ VLMAX when VLMAX ≤ AVL < (2 * VLMAX) and to return VLMAX when AVL ≥ (2 * VLMAX)

-.-

Jan 10, 2024

> if avl is greater than VLMAX then VLMAX is returned

That'd be logical, right? Too bad the RVV spec likes to throw curveballs at unsuspecting developers.

https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#constraints-on-setting-vl

Expand full comment

2 replies by FPRox

camel-cdr

I recommend https://dzaima.github.io/intrinsics-viewer/ as a reference for the intrinsics.

I ran the float sum benchmark with 10000 elements and rdcycle on a C920, here are the results:

scalar: 27000 cycles

LMUL=1: 10792 cycles

LMUL=2: 9337 cycles

LMUL=4: 8702 cycles

LMUL=8: 10553 cycles

You can see how LMUL>1 basically acts as loop unrolling, as the C920 has DLEN<=VLEN. The reason LMUL=8 is slower than LMUL=4 is, presumably, because the core can issue one 512 bit load and one 512 bit store in parallel, but with LMUL=8 it can't (or rather doesn't) interleave the load stores. I expect future implementations to not suffer from this problem.

3 more comments...

What are you optimizing for ? (fprox's substack)

RISC-V Vector in a Nutshell

Discussion about this post