-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Support Fused MoE & Qwen3 GGUF MoE models #3221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@ivarflakstad Could you help review this PR? I also made a few additional changes, including exposing the device pointer of qtensor, and updating the kernel build to supports generating both the PTX and library file at the same time (with a custom bindgen_cuda crate). I added a link to vllm.rs in the README as well (in the section for Candle-based projects) — hope that’s okay! |
ivarflakstad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, looks great!
Only able to give a preliminary review rn. I'll go deeper as soon as I have the time 👍
Thanks @ivarflakstad for the timely review — I’ve fixed the typos. |
ivarflakstad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! 🚀
I haven't found any issues when testing, so I think it's best to merge now and address potential issues if they are discovered.
| - [`candle-einops`](https://github.com/tomsanbear/candle-einops): A pure rust implementation of the python [einops](https://github.com/arogozhnikov/einops) library. | ||
| - [`atoma-infer`](https://github.com/atoma-network/atoma-infer): A Rust library for fast inference at scale, leveraging FlashAttention2 for efficient attention computation, PagedAttention for efficient KV-cache memory management, and multi-GPU support. It is OpenAI api compatible. | ||
| - [`llms-from-scratch-rs`](https://github.com/nerdai/llms-from-scratch-rs): A comprehensive Rust translation of the code from Sebastian Raschka's Build an LLM from Scratch book. | ||
| - [`vllm.rs`](https://github.com/guoqingbao/vllm.rs): A minimalist vLLM implementation in Rust based on Candle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
| // For long-context (32K+), need to use custom sort kernel | ||
| // #[cfg(feature = "cuda")] | ||
| // { | ||
| // use attention_rs::sort::ArgSortOp; | ||
| // topk_ids.flatten_all()?.sort(true)? | ||
| // } | ||
| // #[cfg(not(feature = "cuda"))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to improving the current sort kernel as well 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, will submit another PR for this.
This PR introduces support for the Fused MoE kernel for both unquantized and quantized models.
Qwen3 MoE GGUF models are now fully supported by leveraging the dedicated MoE kernel developed for the Candle ecosystem.
🔧 Usage Examples
Local GGUF File
Load from Hugging Face
cargo run --features cuda --example quantized-qwen3-moe --release -- --which 32b_q4k --prompt "A train is travelling at 120mph, how far does it travel in 3 minutes 30 seconds?"Available presets via
--whichargument:16b_q2k, 16b_q4k, 16b_q6k, 16b_q80, 32b_q2k, 32b_q4k, 32b_q6k, 32b_q80Unquantized Model (Fused MoE Kernel)
Run the unquantized Qwen3-32B-A3B model using the fused MoE kernel (⚠ requires ~80GB GPU memory):
Or load remotely:
📝 Testing Status
Full inference on the unquantized Qwen3-32B-A3B model has not been completed here due to GPU memory limitations (only the first 20 layers were tested).
However, the added code path has already been verified in other projects (under multi-rank configuration) —including candle-vllm and vllm.rs—where it functions correctly.
To run full inference on unquantized 32B+ models, a multi-rank / multi-GPU example will likely be required. 🔜