RISC-V Vector extension in a nutshell (Part 5.2): vector loads and store

Sep 26, 2022

This article is the part 5.2 of a series on RISC-V vector extension, it continues and concludes the sub-series on vector memory operations.

Part 1 introduces the series with a general overview, Part 2 reviews arithmetic operations, Part 3 surveys operations with or on masks, and Part 4 describes permute instructions. Part 5.1 introduced memory operations, in this post we continue the survey of memory operations.

Indexed memory operations

https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#76-vector-indexed-instructions

The indexed memory operations are the memory equivalent to the vrgather family of instructions: they implement load gathers and store scatters. Contrary to the segmented loads and stores presented in Part 5.1 , the indexed memory operations do not assume a regular memory layout. This lack of regularity often means they expose lower throughput and higher latency than other memory operations.There exist two variants for the indexed operations: unordered and ordered. While unordered can appear to be executed in any order, ordered instructions "preserve element ordering on memory accesses". The former lets the micro-architecture re-order accesses, for example favoring access coalescing, even if this ordering is visible, while the latter forces the micro-architecture to ensure everything eventually happens as if the accesses were executed in element order. Indexed vector loads and stores read the memory address from the rs1 operand, the index vector from the vs2 operand, indexed stores read the data from the vs3 operand and indexed loads write the result to the vd operand. The vector loads and stores use two different element widths: one for the data, read from the global vtype SEW configuration, and one for the indices, encoded in the opcode.

Whole register instructions

https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#79-vector-loadstore-whole-register-instructions

There exist load and store versions of the whole vector register move operations (vmv<n>r): whole vector register loads vl<n>re<ew> and whole vector register stores: vs<n>r. As for whole register moves, the group multiplier (EMUL) is encoded in the instruction opcode (though the nf field which encodes a NFIELD value, supported values are 1, 2, 4 and 8), the register group must be aligned as any register group (e.g. v0 is a valid source for vl8re16 but v4 is not).

NOTE: The effective element width EW encoded in the opcode is redundant (the memory layout and the number of bytes stored/loaded is independent from the element width) but was added for implementations which optimize vector data layout based on element width: it can be used as a hint to optimize future vector register uses, this hint is only provided for whole register vector loads.

Managing faults during vector memory operations

Before diving into the fault-only-first loads, let us review how vstart can be used to execute a vector memory operation in multiple passes without duplicate work (.e.g which could be particularly useful to ensure forward progress or to avoid accessing memory regions with side effect multiple times).

Usage of vstart by vector memory operations

https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#37-vector-start-index-csr-vstart

If a trap occurs during the execution of a vector operation, it can be executed partially and resumed after the trap cause has been handled. To that goal, RVV specifies a vstart configuration register, particularly useful for memory operations.

NOTE: It is possible for most arithmetic instructions not to set vstart on a trap and rather to complete execution before jumping to the trap handler. Thus they do not have to support non-zero vstart values.

When a vector load or vector store triggers an exception at the i-th element (.e.g. address no mapping), it performs the operation for the elements vstart to (i-1) and set vstart to i. That way, once the exception cause is handled (e.g. memory address translation), the operation can be restarted from the faulting element, and not from the beginning.

NOTE: Most successful vector operations will reset vstart to zero.

Fault-only-first loads: vstart and vl

https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#77-unit-stride-fault-only-first-loads

Fault-only-first (FOF) loads have two behaviors when it comes to fault handling:

if an exception occurs on the first element of the vector (index 0), vl is left unchanged and a trap is taken
if an exception occurs on any other elements, the trap is not taken and vl is modified to the index of the trapping elements

NOTE: if an exception occurs after the first element on a FOF load, then vl indicates how many elements were successfully loaded without triggering exceptions.

RVV 1.0 specification states that FOF vector load can be used to vectorize while loops.

Conclusion

This post was the last of this series to present RISC-V Vector Extension. I hope you learned something and that RVV makes more sense now. Please leave a comment or question, and I will do my best to respond and improve the posts. Stay tuned for more content on RISC-V and its extensions.

Reference

RVV 1.0 indexed memory operation specification https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc#76-vector-indexed-instructions

What are you optimizing for ? (fprox's substack)

Discussion about this post