Automation for GPU layers, tensor split, tensor overrides, and context size (with MoE optimizations) #18049
JohannesGaessler
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
CPU + GPU hybrid inference has been a core feature of llama.cpp since early on, and I would argue, one of the major selling points vs. projects like ExLlama.
The way to control memory use until now was to manually set parameter like
--n-gpu-layersand--tensor-splitto fit memory use to free VRAM.However, this is of course suboptimal in terms of usability.
Downstream projects like Ollama and KoboldCpp have implemented mechanisms for automating memory allocation but those rely on rough heuristics and tend to be inaccurate.
As a consequence, to avoid running out of memory in some cases the heuristics are rather conservative and leave potential performance on the table.
The problem becomes even harder when running models across multiple GPUs or when running MoE models where the dense tensors
should be prioritized over the sparse MoE tensors for optimal performance.
On the latest llama.cpp version following #16653 I implemented code to automate memory allocations across GPUs.
It works by doing virtual test allocations and using those as feedback to iteratively reduce memory use until the model fits across all GPUs.
The metric for memory use is the same as in the "memory breakdown" that you may have seen in recent llama.cpp versions.
The implementation is generic and should work for any ggml backend as long as it supports CPU + GPU hybrid inference and the memory breakdown is correct.
If you encounter problems using this new functionality, please open an issue instead of commenting here as this will make the process easier from my side.
The code starts by first checking whether the model is projected to fit as-is.
If yes, no changes are made.
If not, it first reduces the context size to free up memory.
If that is still not enough it starts moving tensors from VRAM to RAM.
Dense tensors are prioritized for better MoE performance.
Ideally one would only assign whole layers to GPUs for simplicity.
However, as individual layers can be very large against "small" GPUs with only 24 GiB VRAM this would result in significant waste.
For this reason, layers can "overflow", meaning that parts of them are moved to the next GPU in line or to system RAM.
Command-Line Interface
The fitting of runtime parameters can be controlled as follows:
--fit,-fit: set toonby default, can be set tooffto disable parameter fitting.--fit-target,-fitt: target amount of free memory to leave on each GPU. As of right now this is the same value for all GPUs and it is not possible to specify e.g. an amount that should be used regardless of free memory.--fit-ctx,-fitc: minimum context size that can be set automatically. If--ctx-sizeis explicitly set by the user it is not changed.--n-gpu-layers,--tensor-split, or--override-tensorthat affect memory allocation are set by the user, there is no change to that memory allocation. There is no support for automatic modification of only one of these arguments, they are either wholly under user control or wholly under program control.There is a new tool
llama-fit-paramsthat can be used to retrieve the parameters that would be set by the new parameter fitting logic.For example:
Benchmark
As of right now
llama-benchdoes not have support for-fit,-fitt, and-fitc.For this reason, the following workaround was used to feed the results from
llama-fit-paramsintollama-bench:The benchmark was done on a system with an AMD EPYC 7742 CPU and 8 3200 "MHz" DIMMs.
The VRAM utilization is at ~85-90%.
As the default
--fit-targetis 1024 MiB, that would ideally leave ~4% of free VRAM on each GPU.However, since individual tensors can be several GB in size some amount of waste is inevitable.
The time to fit the parameters increases roughly linearly with the number of GPUs.
Under ideal circumstances such as when running GPT OSS 120b on 4x RTX 4090 the code only needs to check that the VRAM is sufficient.
For Qwen 3 Next there currently seems to be a bug where the memory needed for the context is not accounted correctly so a full fit is done.
Time to fit is still fairly unoptimized.
Performance mostly increases as VRAM use increases, except when going from a single GPU to two GPUs
(while still being bottlenecked by RAM) or when the model could already be fit on fewer GPUs.
With better multi GPU code the performance should increase monotonically as more GPUs are added.
Beta Was this translation helpful? Give feedback.
All reactions