Skip to content

Conversation

@heheda12345
Copy link
Collaborator

@heheda12345 heheda12345 commented Dec 9, 2025

Purpose

We now forgets to pass the draft tokens from model runner to scheduler. This PR fix it.

Test Plan

VLLM_ENABLE_V1_MULTIPROCESSING=0 python3 examples/offline_inference/spec_decode.py --test

Test Result

Before this PR:

--------------------------------------------------
total_num_output_tokens: 247913
num_drafts: 0
num_draft_tokens: 0
num_accepted_tokens: 0
mean acceptance length: 1.00
--------------------------------------------------
acceptance at token 0: 0.00
acceptance at token 1: 0.00

After this PR:

--------------------------------------------------
total_num_output_tokens: 247893
num_drafts: 122887
num_draft_tokens: 245774
num_accepted_tokens: 124787
mean acceptance length: 2.02
--------------------------------------------------
acceptance at token 0: 0.69
acceptance at token 1: 0.33

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mergify mergify bot added the v1 label Dec 9, 2025
@heheda12345 heheda12345 requested a review from njhill December 9, 2025 08:27
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix speculative decoding when not using multiprocessing by adding a call to post_step. While the change correctly identifies the missing call, the implementation hardcodes model_executed=True. This could lead to issues if step_fn is called without the model actually executing. I've provided a suggestion to use the model_executed flag returned by step_fn to ensure correctness and maintain consistency with the multiprocessing implementation.

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @heheda12345!

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 12, 2025
@njhill njhill enabled auto-merge (squash) December 12, 2025 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants