-
-
Notifications
You must be signed in to change notification settings - Fork 12k
[Bugfix] Fix cuda graph sizes when running with speculative decoding #30330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix cuda graph sizes when running with speculative decoding #30330
Conversation
Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request addresses a bug where CUDA graph sizes were not correctly captured when speculative decoding was enabled, particularly when max_seq_len was small and num_speculative_tokens was greater than 1. The fix correctly incorporates the number of speculative tokens into the calculation of max_cudagraph_capture_size, ensuring that the system captures appropriate graph sizes for speculative decoding scenarios. This improves the correctness and efficiency of CUDA graph utilization under these conditions.
Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai>
|
Hi @PatrykSaffer, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai>
…llm-project#30330) Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com> Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai> Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com>
…llm-project#30330) Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com> Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai> Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com> Signed-off-by: Nathan Price <nathan@abridge.com>
…llm-project#30330) Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com> Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai> Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com> Signed-off-by: Nathan Price <nathan@abridge.com> Signed-off-by: Nathan Price <nathan@abridge.com>
…llm-project#30330) Signed-off-by: Patryk Saffer <patryk.saffer99@gmail.com> Signed-off-by: PatrykSaffer <patryk.saffer@mistral.ai> Co-authored-by: Patryk Saffer <patryk.saffer99@gmail.com> Signed-off-by: Nathan Price <nathan@abridge.com>
Purpose
When specifying
max_seq_lenand running withnum_speculative_tokens> 1 not all cuda graph sizes are being captured.Eg.
max_seq_len == 1, num_speculative_tokens==3
Only graph for batch_size==2 is being captured.
Test Plan
Manual test
Test Result
Tested manually