Skip to content

Conversation

@cfallin
Copy link
Member

@cfallin cfallin commented Dec 19, 2025

The debug support in Wasmtime so far is structured around async callbacks that occur at certain kinds of events, like breakpoints. This is a suitable foundation but makes for an awkward implementation of a top-level debugger implementation, which likely has an event loop dealing with user commands (via a UI or a protocol connection) and expects to perform actions such as "run until next breakpoint".

This PR introduces a new crate that wraps a Store in a Debugger. This wrapper embodies an inner async body that can perform whatever actions it likes on the Store that is passed back in. This inner body is spawned as an async task. The debugger wrapper registers its own DebugHandler callback that communicates with the outside world via bidirectional command/response queues.

On the "outside", the Debugger presents an interface suitable for inserting into a debug protocol server or UI: an async method that runs until next event and returns that event, and a method that permits querying or modifying the store whenever the run method is not executing. The latter operates by sending a closure over the queue, because the Store must continue to be owned by the async task that is (still) running and suspended in async callbacks.

Right now, this is exercised only via a few unit tests, but the intent is to next build up the "top half" of the debugger using this abstraction, e.g. by running a gdbstub protocol server (likely as a Wasm component in a "debug-main WIT world" -- RFC needed for this).

Also, when we eventually move debugging over to native use of run_concurrent, this paradigm should remain mostly unchanged at this level of API: there can still be an object that has an async method that runs and yields the next event, and there can still be a method that takes a closure that can operate (within its scope only) on the Store.

A few warts that I could use feedback on:

  • Cancelation safety is weird. Fibers panic when dropped before execution of their body completes, and this seems to mean that we can't allow a Debugger to drop early (or at least, the tokio::test unit test that owns the runtime that runs the async task to finish before the debugged body completes!). If there is a better way to handle cancelation safety here, I'm all ears.

  • It's not clear to me if the boxed-closure-and-Any approach to providing access to the Store is the best we can do, but I suspect it is.

… environment".

The debug support in Wasmtime so far is structured around async
callbacks that occur at certain kinds of events, like breakpoints. This
is a suitable foundation but makes for an awkward implementation of a
top-level debugger implementation, which likely has an event loop
dealing with user commands (via a UI or a protocol connection) and
expects to perform actions such as "run until next breakpoint".

This PR introduces a new crate that wraps a `Store` in a `Debugger`.
This wrapper embodies an inner async body that can perform whatever
actions it likes on the `Store` that is passed back in. This inner body
is spawned as an async task. The debugger wrapper registers its own
`DebugHandler` callback that communicates with the outside world via
bidirectional command/response queues.

On the "outside", the `Debugger` presents an interface suitable for
inserting into a debug protocol server or UI: an async method that runs
until next event and returns that event, and a method that permits
querying or modifying the store whenever the `run` method is not
executing. The latter operates by sending a closure over the queue,
because the `Store` must continue to be owned by the async task that is
(still) running and suspended in async callbacks.

Right now, this is exercised only via a few unit tests, but the intent
is to next build up the "top half" of the debugger using this
abstraction, e.g. by running a gdbstub protocol server (likely as a Wasm
component in a "debug-main WIT world" -- RFC needed for this).

Also, when we eventually move debugging over to native use of
`run_concurrent`, this paradigm should remain mostly unchanged at this
level of API: there can still be an object that has an async method that
runs and yields the next event, and there can still be a method that
takes a closure that can operate (within its scope only) on the `Store`.

A few warts that I could use feedback on:

- Cancelation safety is weird. Fibers panic when dropped before
  execution of their body completes, and this seems to mean that we
  can't allow a `Debugger` to drop early (or at least, the `tokio::test`
  unit test that owns the runtime that runs the async task to finish
  before the debugged body completes!). If there is a better way to
  handle cancelation safety here, I'm all ears.

- It's not clear to me if the boxed-closure-and-`Any` approach to
  providing access to the `Store` is the best we can do, but I suspect
  it is.
@cfallin cfallin requested a review from alexcrichton December 19, 2025 02:56
@cfallin cfallin requested review from a team as code owners December 19, 2025 02:56
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to context switch, but want to get my comments thus far posted.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to mean that we can't allow a Debugger to drop early

This is definitely a bug in Wasmtime that we need to fix. Can you add a failing test or gist one that I can help poke around?

@cfallin
Copy link
Member Author

cfallin commented Dec 19, 2025

This is definitely a bug in Wasmtime that we need to fix. Can you add a failing test or gist one that I can help poke around?

Sure thing -- with this commit and

cargo test -p wasmtime-internal-debugger -- basic_debugger

I see

thread 'test::basic_debugger' (1129985) panicked at /home/cfallin/work/wasmtime/crates/wasmtime/src/runtime/fiber.rs:199:70:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

@cfallin
Copy link
Member Author

cfallin commented Dec 19, 2025

(It's entirely possible that I'm holding some part of the async machinery wrong, either in this PR or in earlier work to add the async debugger hook -- happy to have your opinions on that if so!)

@alexcrichton
Copy link
Member

alexcrichton commented Dec 22, 2025

That'll get fixed by this:

diff --git a/crates/wasmtime/src/runtime/fiber.rs b/crates/wasmtime/src/runtime/fiber.rs
index 68dce6884f..1ab5120ef3 100644
--- a/crates/wasmtime/src/runtime/fiber.rs
+++ b/crates/wasmtime/src/runtime/fiber.rs
@@ -358,7 +358,7 @@ impl StoreOpaque {
     }

     /// Returns whether `block_on` will succeed or panic.
-    #[cfg(feature = "call-hook")]
+    #[cfg(any(feature = "call-hook", feature = "debug"))]
     pub(crate) fn can_block(&mut self) -> bool {
         self.fiber_async_state_mut().current_future_cx.is_some()
     }
diff --git a/crates/wasmtime/src/runtime/store.rs b/crates/wasmtime/src/runtime/store.rs
index 0ce69fc6c3..ac6929d376 100644
--- a/crates/wasmtime/src/runtime/store.rs
+++ b/crates/wasmtime/src/runtime/store.rs
@@ -2839,6 +2839,9 @@ unsafe impl<T> VMStore for StoreInner<T> {
     #[cfg(feature = "debug")]
     fn block_on_debug_handler(&mut self, event: crate::DebugEvent<'_>) -> anyhow::Result<()> {
         if let Some(handler) = self.debug_handler.take() {
+            if !self.can_block() {
+                bail!("couldn't grab async_cx for debug handler")
+            }
             log::trace!("about to raise debug event {event:?}");
             StoreContextMut(self).with_blocking(|store, cx| {
                 cx.block_on(Pin::from(handler.handle(store, event)).as_mut())

tl;dr; when a store is dropped all fibers are resumed with a trap that says "please die immediately" but there are some circumstances in internal code where during the process of returning an error from everything something async is attempted. The async attempt fails because the blocking context (rightfully) wasn't configured.

This happened before with call hooks for example where we would try to invoke the call hook when the fiber is exiting saying "hey a trap happened". The fix here is to just skip debug events/call hooks/etc during this situation since if the store is being dropped that's no interesting anyway

(if you wouldn't mind this'd be a good thing to add a test for too)

@cfallin
Copy link
Member Author

cfallin commented Dec 22, 2025

Awesome, thanks @alexcrichton for that fix! Updated and added a test with an early drop of debugger/store.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 from me (I defer to you/Nick to possibly await his approval too)

@cfallin
Copy link
Member Author

cfallin commented Dec 22, 2025

Thanks! I'll go ahead and merge and we can always tweak anything we need as we build on top of this.

@cfallin cfallin added this pull request to the merge queue Dec 22, 2025
@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Dec 22, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 22, 2025
cfallin added a commit to cfallin/wasmtime that referenced this pull request Dec 23, 2025
These Pulley instructions are an optimization: they incorporate moves
from other registers to x0/x1/x2/x3 as part of the call itself.

This works fine for the default calling convention, but the logic that
absorbs these moves and removes them from the list given to the rest of
the VCode/ABI machinery embeds a tacit assumption: the argument
registers are volatile (caller-saved). In particular, this is because
the call{1,2,3,4} instructions are *not* marked as clobbering
x0/x1/x2/x3 explicitly.

To fix this issue, this PR simply omits the magical folding that happens
in the emission code. An alternative would be to try to add the clobbers
to the special instructions, but actually we don't want these to be just
"clobbers": we want the usual regalloc constraints to take effect and
the values to be represented as continuing valid/live in the registers
after return, and the proper way to do that is to leave the `args` on
the call and let the regalloc operand pass add the fixed-reg operands.

Discovered while trying to land bytecodealliance#12183.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Dec 23, 2025
These Pulley instructions are an optimization: they incorporate moves
from other registers to x0/x1/x2/x3 as part of the call itself.

This works fine for the default calling convention, but the logic that
absorbs these moves and removes them from the list given to the rest of
the VCode/ABI machinery embeds a tacit assumption: the argument
registers are volatile (caller-saved). In particular, this is because
the call{1,2,3,4} instructions are *not* marked as clobbering
x0/x1/x2/x3 explicitly.

To fix this issue, this PR simply omits the magical folding that happens
in the emission code. An alternative would be to try to add the clobbers
to the special instructions, but actually we don't want these to be just
"clobbers": we want the usual regalloc constraints to take effect and
the values to be represented as continuing valid/live in the registers
after return, and the proper way to do that is to leave the `args` on
the call and let the regalloc operand pass add the fixed-reg operands.

Discovered while trying to land bytecodealliance#12183.
@cfallin
Copy link
Member Author

cfallin commented Dec 23, 2025

CI failure on i686 (i.e., pulley32) is due to a problem with PreserveAll interacting poorly with Pulley's call1/2/3/4 optimization; #12213 fixes (tested locally). Will merge that here once that PR lands.

github-merge-queue bot pushed a commit that referenced this pull request Dec 23, 2025
…BI. (#12213)

These Pulley instructions are an optimization: they incorporate moves
from other registers to x0/x1/x2/x3 as part of the call itself.

This works fine for the default calling convention, but the logic that
absorbs these moves and removes them from the list given to the rest of
the VCode/ABI machinery embeds a tacit assumption: the argument
registers are volatile (caller-saved). In particular, this is because
the call{1,2,3,4} instructions are *not* marked as clobbering
x0/x1/x2/x3 explicitly.

To fix this issue, this PR simply omits the magical folding that happens
in the emission code. An alternative would be to try to add the clobbers
to the special instructions, but actually we don't want these to be just
"clobbers": we want the usual regalloc constraints to take effect and
the values to be represented as continuing valid/live in the registers
after return, and the proper way to do that is to leave the `args` on
the call and let the regalloc operand pass add the fixed-reg operands.

Discovered while trying to land #12183.
@cfallin cfallin enabled auto-merge December 23, 2025 19:23
@cfallin cfallin added this pull request to the merge queue Dec 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 23, 2025
@alexcrichton
Copy link
Member

The CI error there is:

Running: `"curl" "--user-agent" "bytecodealliance/wasmtime auto-publish script" "https://crates.io/api/v1/crates/wasmtime-internal-debugger/owners"`

thread 'main' (2599) panicked at scripts/publish.rs:623:17:

crate wasmtime-internal-debugger is not owned by wasmtime-publish, please run:

    cargo owner -a wasmtime-publish wasmtime-internal-debugger

note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

notably we verify that all crates are owned by wasmtime-publish and already published to crates.io -- this ensures the auto-publish later succeeds. Can you publish a stub crate and invite the wasmtime-publish owner?

@cfallin
Copy link
Member Author

cfallin commented Dec 23, 2025

Yep, I've just published the crate and invited wasmtime-publish -- let me know when ready to re-attempt landing.

@cfallin cfallin added this pull request to the merge queue Dec 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 23, 2025
@cfallin
Copy link
Member Author

cfallin commented Dec 23, 2025

I'm not sure I understand the failure now -- I added wasmtime-publish and it shows as an owner; I removed myself as an owner; it's still failing. @alexcrichton any ideas?

@alexcrichton
Copy link
Member

Oh that's wasmtime-publish-the-github-team not wasmtime-publish-the-github user. If you run the comment it printed out it should be resolved

@cfallin
Copy link
Member Author

cfallin commented Dec 23, 2025

Hmm. Unfortunately I removed myself from the crate after wasmtime-publish accepted the invite. So I'm not able to make any more permissions changes.

Do you have credentials for that user so you can log in and invite the team? Or have we lost the crate name forever?

(FWIW, I want to reiterate my suggestion from earlier that we have a runbook for all of this now that we have such a complex publishing and permissions scheme, because this was all very ambiguous and undocumented to me)

@alexcrichton
Copy link
Member

Yeah the BA has a 1password with credentials for wasmtime-publish (that's how I accepted the invite in the first place). I added the team. Agreed this can be documented better (like pretty much everything else infrastructure-related), and it's something I hope to handle in January with the move to trusted publishing.

@alexcrichton alexcrichton added this pull request to the merge queue Dec 24, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 24, 2025
- Do not link to non-existent README in new crate.
- Remove a few other attributes that our internal crates don't have (now
  the set of attributes at the top level is the same as for the new
  error crate).
@cfallin cfallin enabled auto-merge December 24, 2025 18:47
@cfallin
Copy link
Member Author

cfallin commented Dec 24, 2025

Thanks!

@cfallin cfallin added this pull request to the merge queue Dec 24, 2025
Merged via the queue into bytecodealliance:main with commit 405ab55 Dec 24, 2025
45 checks passed
@cfallin cfallin deleted the debugger-crate branch December 24, 2025 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wasmtime:api Related to the API of the `wasmtime` crate itself

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants