Testing RISC-V Vector instructions on the BPI-F3

adrianoco · December 12, 2024, 1:55am

Introduction

The RISC-V ISA includes an optional Vector Extension (RVV) designed to accelerate parallel processing tasks by operating on vector registers. The current version of the vector extension, RVV 1.0, standardizes features such as configurable vector lengths (VLEN), flexible data types, and dynamic vector-length control, enabling efficient vectorized computation. RVV is optional in RVA 22 but mandatory in RVA 23, which is expected to drive broader adoption of these features in future RISC-V hardware implementations. This post aims to explore and test RVV 1.0, focusing on potential for future hardware support and software packaging.

While RVV 1.0 and RVA 22 show promise for high-performance applications, there is currently limited hardware that fully implements these features. The BPI-F3, powered by the SpacemiT® Key Stone™ K1 SoC, supports RVA 22 and RVV 1.0. In this post, we will explore and test the vector capabilities of the BPI-F3 by evaluating the performance of a couple of examples implemented both using RVV 1.0 instructions and a traditional scalar approach.

To generate meaningful tests, it is important to understand the parameters that govern RVV 1.0 vector operations. Let’s begin with a brief overview of the key concepts behind RVV 1.0.

RISC-V Vectors

The extension defines 32 physical vector registers (v0 to v31), each of the same length in bits determined by the hardware’s VLEN, which is a microarchitectural property.

A single vector instruction can moreover operate on more than one vector at a time. The number of vectors a single instruction can operate on is governed by the dynamic LMUL parameter, which scales the effective length of vector registers by grouping them. For example, with LMUL=4, each vector register effectively spans 4 physical vector registers, reducing the total number of available registers but enabling operations on wider datasets.

The size in bits of the elements that compose a vector is called the SEW (Selected Element Width), so that the number of elements in a vector is VLEN/SEW.

Finally, the number of elements that a vector register effectively spans (taking into account LMUL) is VLMAX = LMUL * VLEN / SEW. These quantities can be set (except VLEN, which is fixed) and queried using the RISC-V Vector Intrinsics library.

// The functions __riscv_vsetvlmax_* return VLMAX for different values of
// LMUL and SEW. So VLEN is a michroarchitectural constant and VLMAX
// changes, but VLEN can always be recovered from VLMAX, SEW and LMUL:

// SEW=8, LMUL=1
printf("VLEN in bits: %zu\n", __riscv_vsetvlmax_e8m1() * 8);
// SEW=32, LMUL=1
printf("VLEN in bits: %zu\n", __riscv_vsetvlmax_e32m1() * 32);
// SEW=16, LMUL=4
printf("VLEN in bits: %zu\n", __riscv_vsetvlmax_e16m4() * 16 / 4);
// SEW=64, LMUL=2
printf("VLEN in bits: %zu\n", __riscv_vsetvlmax_e64m2() * 64 / 2);

return 0;

}

For example, BPI-F3 returns a VLEN of 256 bits.

Testing

The following plot summarizes the performance of multiplying two vectors of doubles and summing up the entries of the result¹

As shown, the performance on the BPI-F3 is heavily influenced by the value of LMUL. For LMUL values greater than 4, a clear speedup is observed. However, for LMUL values less than 4, the overhead seems to outweigh the benefits. In fact, the performance is worse than the scalar approach for these smaller LMUL values.

This performance drop at lower LMUL values can likely be attributed to increased overhead, including memory transfer costs. A potential way to mitigate this issue could be through a test that minimizes memory transfers. For instance, multiplying a fixed 4x4 matrix by a large number of 4x1 vectors could reduce memory overhead and provide a clearer picture of the actual computational performance.

The next plot summarizes the performance of multiplying a fixed 4x4 matrix by many 4x1 vectors, offering a direct comparison under these conditions. The test is performed in two ways: one approach involves loading the matrix into four vectors and then multiplying each of those vectors by the scalar members of the 4x1 vector (LMUL=1). The other approach places the 4x4 matrix into a 16-element vector, creates a 16-element vector with the scalars, and performs the multiplication in one go (LMUL=4).

The results here do not depend on N, as expected when N is much larger than the vector length. However, the best performance is lower than in the previous test, even when LMUL=4 vectors are used.

1. A source of this test for LMUL=1 and fixed vector length can be found at RISC-V Vector Intrinsics Example: rvv_reduce.c . The test is a modifications of this for different values of LMUL.