Support Fused MoE & Qwen3 GGUF MoE models #3221

guoqingbao · 2025-12-02T11:25:17Z

This PR introduces support for the Fused MoE kernel for both unquantized and quantized models.
Qwen3 MoE GGUF models are now fully supported by leveraging the dedicated MoE kernel developed for the Candle ecosystem.

🔧 Usage Examples

Local GGUF File

cargo run --features cuda --example quantized-qwen3-moe --release -- --model /path/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf

💡 Performance note: The quantized Qwen3-30B-A3B model achieves ~80 tokens/s on an A100 40GB PCIe (single GPU) in other projects such as candle-vllm and vllm.rs. Performance here may vary depending on runtime and configuration.

Load from Hugging Face

cargo run --features cuda --example quantized-qwen3-moe --release -- --which 32b_q4k --prompt "A train is travelling at 120mph, how far does it travel in 3 minutes 30 seconds?"

Available presets via --which argument:
16b_q2k, 16b_q4k, 16b_q6k, 16b_q80, 32b_q2k, 32b_q4k, 32b_q6k, 32b_q80

Unquantized Model (Fused MoE Kernel)

Run the unquantized Qwen3-32B-A3B model using the fused MoE kernel (⚠ requires ~80GB GPU memory):

cargo run --example qwen --features cuda --release -- --prompt "Write a poem about butterflies." --model "3-moe-a3b" --weight-path /data/shared/Qwen3-30B-A3B-Instruct-2507

Or load remotely:

cargo run --example qwen --features cuda --release -- --prompt "Write a poem about butterflies." --model "3-moe-a3b"

📝 Testing Status

Full inference on the unquantized Qwen3-32B-A3B model has not been completed here due to GPU memory limitations (only the first 20 layers were tested).

However, the added code path has already been verified in other projects (under multi-rank configuration) —including candle-vllm and vllm.rs—where it functions correctly.

To run full inference on unquantized 32B+ models, a multi-rank / multi-GPU example will likely be required. 🔜

guoqingbao · 2025-12-02T11:31:02Z

@ivarflakstad Could you help review this PR?

I also made a few additional changes, including exposing the device pointer of qtensor, and updating the kernel build to supports generating both the PTX and library file at the same time (with a custom bindgen_cuda crate).

I added a link to vllm.rs in the README as well (in the section for Candle-based projects) — hope that’s okay!

ivarflakstad

Thanks for this, looks great!

Only able to give a preliminary review rn. I'll go deeper as soon as I have the time 👍

candle-nn/src/moe.rs

guoqingbao · 2025-12-03T02:35:32Z

Thanks for this, looks great!

Only able to give a preliminary review rn. I'll go deeper as soon as I have the time 👍

Thanks @ivarflakstad for the timely review — I’ve fixed the typos.

ivarflakstad

Amazing work! 🚀

I haven't found any issues when testing, so I think it's best to merge now and address potential issues if they are discovered.

ivarflakstad · 2025-12-23T15:30:46Z

README.md

 - [`candle-einops`](https://github.com/tomsanbear/candle-einops): A pure rust implementation of the python [einops](https://github.com/arogozhnikov/einops) library.
 - [`atoma-infer`](https://github.com/atoma-network/atoma-infer): A Rust library for fast inference at scale, leveraging FlashAttention2 for efficient attention computation, PagedAttention for efficient KV-cache memory management, and multi-GPU support. It is OpenAI api compatible.
 - [`llms-from-scratch-rs`](https://github.com/nerdai/llms-from-scratch-rs): A comprehensive Rust translation of the code from Sebastian Raschka's Build an LLM from Scratch book.
+- [`vllm.rs`](https://github.com/guoqingbao/vllm.rs): A minimalist vLLM implementation in Rust based on Candle.


ivarflakstad · 2025-12-23T15:33:09Z

candle-transformers/src/fused_moe.rs

+            // For long-context (32K+), need to use custom sort kernel
+            // #[cfg(feature = "cuda")]
+            // {
+            //     use attention_rs::sort::ArgSortOp;
+            //     topk_ids.flatten_all()?.sort(true)?
+            // }
+            // #[cfg(not(feature = "cuda"))]


I'm open to improving the current sort kernel as well 👍

Thanks for the review, will submit another PR for this.

Support Fused MoE & Qwen3 GGUF MoE models

ec5b158

ivarflakstad reviewed Dec 2, 2025

View reviewed changes

candle-nn/src/moe.rs Outdated Show resolved Hide resolved

candle-nn/src/moe.rs Outdated Show resolved Hide resolved

candle-nn/src/moe.rs Outdated Show resolved Hide resolved

Typo and cargo clippy fix

9e8e9f0

guoqingbao mentioned this pull request Dec 18, 2025

Support simultaneous ptx and lib building (multiple builds in a single builder) Narsil/bindgen_cuda#5

Open

guoqingbao and others added 3 commits December 18, 2025 18:01

Merge branch 'main' into upstream

76bb962

Clippy fix

0a3988f

Merge branch 'main' into upstream

e94e2ba

ivarflakstad approved these changes Dec 23, 2025

View reviewed changes

ivarflakstad merged commit f2d5aab into huggingface:main Dec 23, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Fused MoE & Qwen3 GGUF MoE models #3221

Support Fused MoE & Qwen3 GGUF MoE models #3221

Uh oh!

guoqingbao commented Dec 2, 2025

Uh oh!

guoqingbao commented Dec 2, 2025

Uh oh!

ivarflakstad left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guoqingbao commented Dec 3, 2025

Uh oh!

ivarflakstad left a comment

Uh oh!

ivarflakstad Dec 23, 2025

Uh oh!

ivarflakstad Dec 23, 2025

Uh oh!

guoqingbao Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Support Fused MoE & Qwen3 GGUF MoE models #3221

Support Fused MoE & Qwen3 GGUF MoE models #3221

Uh oh!

Conversation

guoqingbao commented Dec 2, 2025

🔧 Usage Examples

Local GGUF File

Load from Hugging Face

Unquantized Model (Fused MoE Kernel)

📝 Testing Status

Uh oh!

guoqingbao commented Dec 2, 2025

Uh oh!

ivarflakstad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

guoqingbao commented Dec 3, 2025

Uh oh!

ivarflakstad left a comment

Choose a reason for hiding this comment

Uh oh!

ivarflakstad Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

ivarflakstad Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

guoqingbao Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants