Note: this post mail be too long to be displayed by a standard email client, you can check it in a web browser to get the full content (and happy new year !).
In a previous post, we presented our first attempt(s) at implementing a polynomial multiplication based on the Number Theoretic Transform algorithm, or NTT, using RISC-V Vector. Our fastest implementation of the multiplication of 128-coefficient polynomial, written using RVV C-intrinsics, was clocking at around 17’820 cycles on the CanMV K230 board. In this post, we are going to push the optimization further, starting with rewriting our best implementation code in full assembly and try to improve it (our initial goal was to get below 10’000 cycles).
You can find the original post in the link below: (recommended reading if you need a refresher on what NTT is and an introduction on our C-based implementations)
Number Theoretic Transform with RVV
The polynomial multiplication using NTT consists in 4 steps:
Converting both input polynomials to the NTT domain: this requires the execution of two forward NTT transforms.
Element-wise multiplication of the two transformed inputs
Inverse NTT transform of the result back into the canonical domain
Scaling the result (element-wise) by the reciprocal of the polynomial degree
Steps 1 and 3 rely on the same function, only the twiddle factors change: the forward transform uses positive powers of the root of unity while the inverse transform uses reciprocal of those powers. Those steps account for the most cycles spent during the polynomial multiplication: more than 90% of the cycles are spent on the 3 calls to the NTT transforms. So our optimization effort will focus on the NTT transform (we will also fuse 2. and 4. to try to reduce the latency further).
Note: some cryptography algorithms based on polynomial multiplication do not require both input to be transformed as one of them is already stored in the NTT domain. Here, we consider the more general case (and more expensive one) when both inputs must be transformed each time.
Optimizing a function in assembly
Our target, the CanMV K230 boards, supports the rv64gcv_zba_zbb
ISA. So we can use any instructions in those extension (in particular the full RVV 1.0 v
extension) but we cannot rely on more recent extension (such as Zvbb
for example).
When and why write assembly ?
Writing assembly is not really the lost ancient art of programming it can appear to be to the neophyte. Although, it is true that the continuous improvements of optimizing compilers has made writing assembly useless and error-prone in most cases: base assembly is not typed, the code can be very brittle, hard to debug and hard to (re)factorize and extend. They remain a few cases when writing small application kernels in assembly can unearth extra performance (and if you have time to spare, it can also be quite fun and useful in learning an architecture and play with the quirks of a microarchitecture).
In our case, writing assembly is driven by multiple considerations: writing an implementation tailored to a particular target (the CanMV K230 Single Board Computer) and trying to push down the latency as much as possible while controlling the program execution as much as possible. Our implementation is not tailored to become a production code and we intend to take some freedom with coding standards.
To avoid starting from scratch, we generated an initial version of the assembly implementation by copying the assembly code generated by the compiler from our C implementations. With clang, this can be done while building the object file by adding the command line option —save-temps
; this dumps into files various intermediary states of the compilation process, including the inlined souce code (*.i
file), the byte-code (*.bc
file), and what interests us: the assembly file (*.s
file).
Note: to generate only the assembly file (running pre-processor and compilation, and spot before assembly and link) you can only use the
-S
command line compiler option.
Function prototype
Let us start by looking at the function prototype. We inherited the function prototype from our original RVV-based implementations of the NTT:
void rvv_ntt_transform_asm_internal(ntt_t* dst,
int* coeffs,
int _n,
int level,
int rootPowers[8][64]);
Let’s review how to unpack its argument in our assembly code.
How to access function arguments ? (ABI)
ABI, or Application Binary Interface, is the “API” used by binary programs and functions. Among other things it specifies the binary calling convention: how function parameters are passed when a function is called, how return values need to be layed out, which registers are saved by the caller, and which are saved by the callee. These last two pieces of information determine which registers our assembly function can use without saving them first, and which would need to be saved (and eventually restored). This is key because it drives how many temporary variable can be alive at the same time without saving any extra registers. Saving registers is possible, by pushing them on the stack for example, but it implies the need to adapt the function frame and eventually restore the register values upon return.
In our particular case, with the prototype listed above, here are the register mapping for the functions parameters:
// a0: address of destination buffer (of type ntt_t*)
// a1: input buffer
// a2: number of elements (should be 128)
// a3: level (should be 6)
// a4: address for root powers 2D array
Our function does not have any return value so we simply need to restore the callee-saved registers and return at the function exit.
Accessing the destination array
The first parameter of our function is the destination buffer. This is a C-structure with multiple parameters. In our function, we are only interested in the int* coeffs
fields, since this will be the start address of our destination array.
typedef struct {
int degree; // degree of the polynomial currently stored
int modulo;
int* coeffs;
size_t coeffSize; // size of coeffs (number of elements)
} polynomial_t;
Our code assumes a particular layout of the structure (defined by the version of the compiler we used to build the code). In our layout the field coeffs
is 8-byte wide (a 64-bit address) and starts at the structure 9-th byte (the first two int
fields degree
and modulo
each occupy 4 bytes).
The corresponding RV64 assembly to load the coeffs
address when the register a0
contains the address dst
of our function argument is simply:
// load into register t0 a 64-bit value from address a0 + 8
ld t0, 8(a0)
This performs the indirection between the address of the structure (contained in a0
) and the address stored in the field coeffs
.
Optimizing the function prototype
In this post, we implement a 7-level NTT working on 128-element input and producing a 128-element output. So we can actually get rid of the arguments we do not use and streamline the prototype by encoding the destination directly as a pointer to an array of int
:
void rvv_ntt_transform_asm_internal(int* dst,
int* coeffs,
int rootPowers[8][64]);
Those changes simplify the argument layout:
// a0: destination buffer
// a1: input buffer
// a2: address for root powers 2D array
We no longer need to unpack the destination buffer from a structure, the buffer’s address is directly transmitted in an argument register.
Leveraging vector to optimize NTT
Let’s now look at optimizing each step of the NTT transform implementation.
An high-level view of the structure of our 128-element NTT algorithm is presented below. We are not going to vary widely from this structure; rather we will try to optimize its implementation (e.g. fusing loops handling multiple steps, specializing code paths, sharing constants, …).
Permutation loop
Our previous experiments have hinted that the use of (unordered) indexed vector loads was a fast way to perform the first step of permuting coefficients in our NTT.
As before, we rely on a pre-computed table containing the coefficient indices (in fact this is exactly the same table that was used in our C implementations). As we are building a position independent code (PIC) binary, we must materialize the table address with some boilerplate code, the execution of this code materializes the address of the table ntt_coeff_indices_128
in the register a5
:
.Lpcrel_hi3:
auipc a3, %got_pcrel_hi(ntt_coeff_indices_128)
ld a5, %pcrel_lo(.Lpcrel_hi3)(a3)
We can then write the first loop to permute the full vector of input coefficients (assuming the register a3
initially contains the avl
, application vector length, value):
// a1: address of the coefficient buffer
// a3: avl; initialized to 128
// a5: current address of the index buffer
// t0: current address of the destination buffer (initialized at
// the start of the buffer)
.loop_permute:
// setting vtype (32-bit elements, LMUL=8, avl=a3)
// getting back vl in a7
vsetvli a7, a3, e32, m8, ta, ma
// loading a new set of vector indices
vle32.v v8, (a5)
// load gather from the input coefficient array (a1) using
// the set of indices loaded previously
vluxei32.v v8, (a1), v8
// storing the result in the destination buffer
vse32.v v8, (t0)
// avl = avl - vl
sub a3, a3, a7
// address_update = vl * 4, where sizeof(int) = 4)\
slli a7, a7, 2
// updating input address
add a5, a5, a7
// updating destination address
add t0, t0, a7
// jumping back if avl has not reached zero yet
bnez a3, .loop_permute
The loop is rather straightforward: it starts by computing the local vector length (for 32-bit element and using LMUL=8) using the avl (a3
) and return vl
in register a7
. It then uses a unit strided vector load, vle32.v
, to load a new set of permutation indices from the array pointer in register a5
. It computes the next chunks of permuted coefficients by using the set of permutation indices to address the input buffer (whose address is store in the register a1
) and stores the result at the current address of the destination buffer pointed to by register t0
. Finally, the address increment is computed from the vector length and all the buffer addresses are updated before jumping back to the start of the loop body if avl
has not reached 0 (this assume that during the final loop iteration a3
/avl
will reach exactly 0).
Unrolling the first levels of butterflies
One the coefficient permutation is done we can start performing the multiple levels of butterfly. In our description, we execute 7 levels for the full NTT. We start at the 6th level which work with pairs of 1-element groups, then level 5th working on pairs of 2-element group until the last 7th level working with a single pair of 64-element groups.
In our fast C implementation we fused level 6, 5 and 4 together (source code), this allowed us to load a single group of coefficient, execute 3 levels of butterfly on it and then store it back in memory before switching to the next group of coefficients.
To optimize the transform we go a few steps further: we fuse those 3 levels (6th, 5th, and 4th) into the initial coefficient permutation loops and we look a fusing lower levels into that loop as well.
Unrolling more levels into the first-levels loops
Our approach has been trying to maximize the use of the largest possible vector register group (LMUL=8). From level 6 to level 2, a full butterfly input, output or intermediary result can fit in such a group. Thus we have implemented those 5 levels together ,right after the initial coefficient permutation. They are no intermediate storing or loading of coefficients during this process: every intermediary results is maintained in a vector register group.
Hoisting constants
Our algorithm requires a few scalar constants:
the modulo q (3329)
the Barrett’s reduction factor
various buffer address increments
and a few vector constants:
masks for pair swaps
twiddle factors to assemble NTT levels
If those constants can be allocated to registers that do not need to be reused until all the loop iterations have completed, then their definition can be hoisted outside the innermost loops, and reuse across iterations without requiring any new instruction (no need to load them back from memory) . The scalar constants are rather small, while the vector ones (in particular some of the twiddle factors) can be large arrays, so it is critical to properly allocate them and try to maximize register file usage while leaving enough room to manipulate intermediary result.
Here is an example of the final vector constant allocation for the loop performing level 6 through 2:
// building mask and twiddle factors
// static allocation of vector registers (invariant in loop)
// v1: mask for level 6
// v2: mask for level 5
// v3: twiddle factors for level 4
// v4v5v6v7: twiddle factors for level 2
// v24v25v26v27v28v29v30v31: twiddle factors for level 5
A total of 15 registers are used to store vector constants. Since v0
is the only register usable for a mask operand, and we have more than one type of mask operand in that loop, it cannot be used to store a constant. Furthermore we need to keep at least two vector registers groups (v8
and v16
with LMUL=8) to store operation operands and results.
Use of Vector Length Specific Optimizations
There are a few more optimizations that can be made if we stop limiting ourself to vector length agnostic (VLA) code and exploit improvements for a specific VLEN value. In our case, We are primarily optimizing for the CanMV K230 (the only RISC-V Board with RVV 1.0 that is accessible to us at the moment) which implements a VLEN=128-bit vector unit.
What kind of optimization can we add if we assume our program only works for VLEN=128 ?
if we need to swap 4-element groups (32-bit element) or any pair of two n-element groups with n * SEW an integer multiple of VLEN, we can use register indexing rather than actual instructions
For example for level 3 of our NTT, the butterfly operates on 8-element groups. The code can leverage LMUL=2 to perform the swap with static vector register group indexing. v8v9
and v12v13
contain even coefficients, and v10v11
, v14v15
contain odd coefficients.
// level 3 butterfly
// a single 8-element group of twiddle factor
// is required (assumed to be present in v16)
vsetvli a6, x0, e32, m2, ta, ma
vmul.vv v18, v10, v16 // multiply odd group v10v11 by twiddles
vmul.vv v22, v14, v16 // multiply odd group v14v15 by twiddles
vsub.vv v10, v8, v18 // building odd result v10v11
vadd.vv v8, v8, v18
vsub.vv v14, v12, v22
vadd.vv v12, v12, v22
Barrett’s reduction with LMUL=8
Our initial implementation of Barrett’s reduction rely on a widening operation which is required because we multiply a 24-bit value (2^24 / q) by the ~24-bit result of an integer multiplication. This widening is then followed by a narrowing shift (extracting the upper bits), by a multiply-subtract to get the remainder and a correction step, because the initial remainder might be greater or equal to q (but always less than 2*q). For LMUL=4, this can be implemented with the following snippets (assuming the input is in the v8 vector register group):
vsetvli a6, x0, e32, m4, ta, mu
vwmul.vx v24, v8, t4
vnsra.wi v24, v24, 24
vnmsac.vx v8, t1, v24
vmsge.vx v0, v8, t1
vsub.vx v8, v8, t1, v0.t
Note: because we are using a masked subtraction to correct only the elements which exceed q we must use the mask undisturbed,
mu
, policy.
The use of the widening multiplication limits our max LMUL to 4. As we wish to experiment with the widest possible LMUL we must come up with a solution. One that works is doing two LMUL=4 to work on the lower and upper part of the vector register group. We can also combine the part post the widening/narroving sequence, and use LMUL=8 for it:
vsetvli a6, x0, e32, m4, ta, mu
vwmul.vx v24, v8, t4
vnsra.wi v24, v24, 24
vnmsac.vx v8, t1, v24
vwmul.vx v16, v12, t4
vnsra.wi v16, v16, 24
vnmsac.vx v12, t1, v16
vsetvli a6, x0, e32, m8, ta, mu
vmsge.vx v0, v8, t1
vsub.vx v8, v8, t1, v0.t
We can reduce this by an extra instruction by noting that a proper register allocation allows us to fuse the two LMUL=4 vnmsac.vx
into a LMUL=8 version.
vsetvli a6, x0, e32, m4, ta, mu
vwmul.vx v24, v8, t4
vnsra.wi v24, v24, 24
vwmul.vx v16, v12, t4
vnsra.wi v28, v16, 24 // was vnsra.wi v16, v16, 24
vsetvli a6, x0, e32, m8, ta, mu
vnmsac.vx v8, t1, v24 // fused 2xLMUL=4 => LMUL=8 vnmsac.vx
vmsge.vx v0, v8, t1
vsub.vx v8, v8, t1, v0.t
Barrett’s reduction without widening/narrowing
But in fact, we could try to be slightly smarter: do we really need explicit widening and narrowing instructions to perform Barrett’s modulo reduction
The expensive part of our Barrett’s implementation is due to the fact that we need to sequence a widening multiplication followed by a narrowing shift to get the upper part of the results. This can be optimized if, rather than using the n=24 and using the approximation of 2^n / q, we use n=32 and 2^32 / q as our Barrett’s coefficient. Since n is now aligned on the element width, we can now rely on vmulh.vx
to directly compute the upper part of the multiplication; thus performing the widening multiply followed by the narrow shift in a single operation !
The snippets to compute Barrett’s reduction becomes:
// v8 contains the input product a.b
// t4 contains 2^32/q = 1290167
// t1 contains the modulo value q=3329
// v16 <- floor(((a.b) * 2^32/q) / 2^32)
vmulh.vx v16, v8, t4
// v8 <- (a.b) - q * floor(((a.b) * 2^32/q) / 2^32)
vnmsac.vx v8, t1, v16
// mask active when element >= 3329 (but always < 2*3329)
vmsge.vx v0, v8, t1
// subtraction of 3329 from overflowing elements
vsub.vx v8, v8, t1, v0.t
The speed-up provided by this optimization is very large (reduction of about 3000 cycles) and is not really related to the assembly implementation (I should have thought about it sooner :-) ).
Bypassing Barrett’s reduction correction
Looking closer at our implementation of Barrett’s modulo reduction you can notice that it consists in two phases:
subtracting the coarse evaluation of the quotient multiplied by the modulo
correcting the remainder if it exceeds the modulo, this correction is at most a subtraction of the modulo value
While both steps are required to get a result between zero (included) and the modulo value (excluded), only the first step is really required to chain multiple operations.
Indeed, since we are multiplying result elements with twiddle factors if we do not perform the reduction at each level we take the risk of overflowing our element size (leading to incorrect result); but the correction step is not required to avoid this.
We can remove every single correction step but the last (the one performed in level 0 for example, or even the one perform in level 0 of the reverse transform only) and still get a correct results. We introduce some redundancy in the representation of most of elements and only eventually collapse this redundancy to get the canonical representation.
We eventually removed all the Barrett’s correction code paths but the last and conditioned it by adding an extra argument to the function to select whether or not this correction should be performed.
// void rvv_ntt_transform_asm_internal(
// int* dst,
// int* coeffs,
// int rootPowers[8][64],
// int finalCorrection);
//
// a0: destination buffer
// a1: input buffer
// a2: address for root powers 2D array
// a3: perform barrett's modulo reduction final correction (in level 0)
Scheduling
Instruction scheduling is another key axis for program speed optimization. Scheduling can be done statically (at programming time or at compile time) or dynimically (just-in-time compilation, execution). The latter, scheduling during execution, requires an advanced micro-architecture, in particular one with out of order execution capability.
I am going to focus on compile time scheduling, assuming the micro-architecture is going to schedule the instruction mostly in the order I am writing them. This optimization is highly specific to the micro-architecture. We can apply a few principles: we must have a model of the target micro-architecture (what are its capabilities ? can it chain / overlap vector operations ? which ones ? what are the latency for each operations ? are they pipelined ?). In a future post, I will show you a campaign of micro-benchmarking of the K230 I am doing to evaluate its CPU, but for the time being I am mostly relying on try and error.
One of the keys is trying not to work against the machine !
You can find one example of manual instruction scheduling here: poly_mult_rvv_asm.S#L186 where the loading of twiddle factors for the level 3 butterfly is anticipated (using a vlr2e32.v
instruction which does not need a vsetvl*
) a few instructions before being actually needed.
Putting it all together
We can now assemble all our optimizations into a single implementation of the NTT transform (those optimizations were evaluated separately and together). The structure of the final version of our NTT transform is presented below:
The algorithm is slightly more complex than the 2-stage (one fused loop for level 6 through 4 and one 2-level generic loop nest for level 3 through 0) than the one used in our original intrinsic-based C implementation; but it is also much faster.
The full implementation is available in the rvv-examples repo: poly_mult_rvv_asm.S.
Benchmarking our work
Results: multiplication latency improvement
Each optimization steps was evaluated by running 500 random test cases 10 times (so 5000 cases overall for each variant). The reported result is the average latency over those 5000 cases, the minimum latency over each of the 10 500-test runs, the maximum latency. The standard deviation for each experiment oscillates between 55.0 and 127.4.
We have reported measurements at various steps of the optimization process.
We start on the left with a measurement of the latency of our intrinsic-based C-implementation and we go to the right adding more and more optimizations.
We can note 3 keys improvements:
fusing 5 levels (rather than 3) in the first loop
Switching to a 32-bit constant in the Barrett’s reduction implementation
Getting rid of extra Barrett’s correction to keep only the last one
We have represented 8 stages of optimizations (about 30 stages where actually evaluated, we have selected the most interesting ones).
The best latency we obtain is around 6’650 cycles on average. We manage to get past the 10’000 cycles target !
Where are the cycles spent ?
We have benchmarked a version of the program where we enable one by one the steps of the NTT transforms to measure the delta latency induced by each one. The initial setup is as follows: we enable the code for the 3 NTT transforms (two forwards and one backward) plus the element wise multiplication and degree scaling; for each transform only the initialization phase and the loop storing the result for the fused levels (level 6 to 2) are enable; the actual code to perform each of the levels is disabled, including the full loops to perform level 1 and level 0. Then we enable each level one by one, and perform a measurement over 10 runs with 100 tests each; the average results divided by 3 are reported below.
Note: the standard deviation for each measurement can be quite large for those measurements and so the accuracy should be considered very approximate.
The bulk of the cycles are spent in the first loop level (init + storing + permute + level 6 through 2) which accounts for a little over 73% of the total NTT transform latency. Level 1 account for a little over 12% of the total latency and level 1 for the remaining.
Further optimizations
We have certainly not explored everything that could be done. Among the things that could be considered:
Reducing the coefficient size: for simplicity we have been using 32-bit to store the NTT coefficients. Once modulo reduced, those coefficients are all less than 3329 and so 16-bit would be enough (or even 12 bits if such an element size existed). This would require most of our vector multiplication to be widening to 32-bit (and another widening to 64-bit would be required for Barrett’s modulo reduction); implying a lot of SEW changes and the impossibility to use LMUL=8 everywhere; but it would still be worth investigating.
Fusing element-wise multiplication and degree scaling step with the inverse NTT transform. One of the key for performance uplift comes from keeping data in registers and to avoid exchanges with the memory subsystem. If we can fuse more large steps of the process and keep data in vector registers, that could bring further speed-up (at least by saving us the vector stores followed by vector loads round-trip).
Using a VLEN specific approach is somehow cheating: it contradicts RVV intent of providing programs that be executed seamlessly across a wide variety of implementation, including with different VLEN values. It reduces the program usability. The approach should be optimized by going back to a VLEN agnostic program. This will require some changes in the fused levels which make use of VLEN knowledge to optimize butterflies, and also twiddle factor loading through whole group vector loads.
Our code heavily relies on
vsetvl*
instructions settingvl
toVLMAX
whenavl
is greater than or equal toVLMAX
. Although this seems to be true on our target, this is actually not mandated by RVV 1.0. This makes this code quite brittle and should be addressed.Looking at the overall diagram, there are a few pairs of vector stores followed by vector loads which could be optimized away by switching the loop processing order between two levels: for example level 0 could start from the most significant groups (odd elements whose indices is 96 to 127) as to exploit the fact that level 1 loop finished by this group; it would no longer be required to store back the results during the last iteration of level 1 loop and load them back during the first iteration of the level 0 loop, thus saving a round-trip to memory.
Conclusion
We have successfully lowered down the latency of our polynomial multiplication implementation. Note, that a lot of the speed-up was unrelated to implementing the code in assembly: fusing more levels (level 6 to 2, compared to level 6 to 4 in our previous implementation) in the first set of loops provided a big speed-up, as well as using the multiply-high for Barrett’s modulo reduction.
Finally we manage to get down to around 6700 cycles ! (or about 2235 cycles per NTT transform).
Reference(s):
Source code:
Benchmark directory: rvv-examples/tree/main/src/polynomial_mult
Full assembly implementation: poly_mult_rvv_asm.S
Original post
> Our code heavily relies on vsetvl* instructions setting vl to VLMAX when avl is greater than or equal to VLMAX. Although this seems to be true on our target, this is actually not mandated by RVV 1.0. This makes this code quite brittle and should be addressed.
This is a single additional min instruction: vsetvli(min(avl, vlmax))
Or you can just always use VLMAX, which works better depends on the context.
> Using a VLEN specific approach is somehow cheating
I think the best approach is to have a general solution, with specialization where beneficial. Branching on VLEN should be extremely cheap, because it's always predicted.