Performance inconsistency in llama-cli and llama-bench with my RVV llamafile SGEMM (BPI-F3) #18104

rehan-10xengineer · 2025-12-16T16:17:34Z

rehan-10xengineer
Dec 16, 2025

Hi everyone,

I am working on adding RISC-V Vector (RVV) support to the llamafile SGEMM implementation, for both FP16 and FP32 matmul, used in the prefill stage of llama.cpp. My board is a Banana Pi BPI-F3 (SpacemiT K1 8-core SoC) with VLEN=256.

I have implemented the kernel, but I am seeing an issue where the performance fluctuates significantly between runs when using the FP16 model. This happens in both llama-cli and llama-bench. This does not happen with the FP32 model.

The Issue

I ran the prefill test (32 tokens) 10 times in a row using 8 threads for both FP32 and FP16 versions of TinyLlama-1.1B.

1. TinyLlama-1.1B (FP32 Model) - Stable

The performance is same in every run.

Tokens/Sec: [6.25, 6.42, 6.24, 6.42, 6.49, 6.21, 6.48, 6.19, 6.42, 6.24]
Result: Consistently around ~6.3 t/s.

2. TinyLlama-1.1B (FP16 Model) - Unstable

The performance randomly jumps between two states: sometimes it is fast (~13 t/s), and other times it is slow (~6 t/s).

Tokens/Sec: [5.96, 5.83, 12.99, 13.00, 13.04, 6.00, 13.06, 12.99, 13.07, 6.06]
Result: In the slow runs, it drops to the same speed as the FP32 model.

Implementation Details

Here is how the SGEMM kernel is implemented:

Kernel: RVV SGEMM using LMUL=4 and VLEN=256.
Tiling: The tile of the output matrix is 2x2.
Job Splitting:
- We split the calculation of the output matrix into jobs for the threads.
- m_job is the number of those 2x2 tiles in a row calculated by one job (one thread).
- Depending on the size of the M dimension, m_job can take values of 8, 4, or 2.
- Chunk Size: The size of the chunk of the output matrix calculated in one job is:
  
  (36 * 2 = 72) rows x (2 * m_job) cols
  
  (where 2 * m_job results in 16, 8, or 4 columns).
- This job size is fixed for a specific matmul operation because it depends entirely on the M dimension size of the output matrix produced.

Key Observations

llama-bench behavior: The behavior is the same in llama-bench as it is in llama-cli. I see the same random fluctuation in both tools.
Thread dependency:
- If I use threads = 1, the numbers are consistent.
- If I use threads > 1 (tested 2, 4, and 8), the variation starts happening.
Standalone Tests: When I test the sgemm kernel separately outside llama.cpp (even with threads), I get consistent performance.

Discussion

Although standalone kernel benchmarking produces consistent results, when integrated into the full llama.cpp codebase, the SGEMM function occasionally slows down for FP16 matmul. It is unclear whether this is due to the kernel itself or interactions within llama.cpp. Another notable observation is that slow FP16 SGEMM runs often coincide with spikes in cache misses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance inconsistency in llama-cli and llama-bench with my RVV llamafile SGEMM (BPI-F3) #18104

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Performance inconsistency in llama-cli and llama-bench with my RVV llamafile SGEMM (BPI-F3) #18104

Uh oh!

rehan-10xengineer Dec 16, 2025

The Issue

Implementation Details

Key Observations

Discussion

Replies: 0 comments

rehan-10xengineer
Dec 16, 2025