Performance inconsistency in llama-cli and llama-bench with my RVV llamafile SGEMM (BPI-F3) #18104
Unanswered
rehan-10xengineer
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I am working on adding RISC-V Vector (RVV) support to the llamafile SGEMM implementation, for both FP16 and FP32 matmul, used in the prefill stage of llama.cpp. My board is a Banana Pi BPI-F3 (SpacemiT K1 8-core SoC) with VLEN=256.
I have implemented the kernel, but I am seeing an issue where the performance fluctuates significantly between runs when using the FP16 model. This happens in both llama-cli and llama-bench. This does not happen with the FP32 model.
The Issue
I ran the prefill test (32 tokens) 10 times in a row using 8 threads for both FP32 and FP16 versions of TinyLlama-1.1B.
1. TinyLlama-1.1B (FP32 Model) - Stable
The performance is same in every run.
2. TinyLlama-1.1B (FP16 Model) - Unstable
The performance randomly jumps between two states: sometimes it is fast (~13 t/s), and other times it is slow (~6 t/s).
Implementation Details
Here is how the SGEMM kernel is implemented:
We split the calculation of the output matrix into jobs for the threads.
m_job is the number of those 2x2 tiles in a row calculated by one job (one thread).
Depending on the size of the M dimension, m_job can take values of 8, 4, or 2.
Chunk Size: The size of the chunk of the output matrix calculated in one job is:
(36 * 2 = 72) rows x (2 * m_job) cols
(where 2 * m_job results in 16, 8, or 4 columns).
This job size is fixed for a specific matmul operation because it depends entirely on the M dimension size of the output matrix produced.
Key Observations
Discussion
Although standalone kernel benchmarking produces consistent results, when integrated into the full llama.cpp codebase, the SGEMM function occasionally slows down for FP16 matmul. It is unclear whether this is due to the kernel itself or interactions within llama.cpp. Another notable observation is that slow FP16 SGEMM runs often coincide with spikes in cache misses.
Beta Was this translation helpful? Give feedback.
All reactions