Skip to content

Commit c71df5b

Browse files
author
Paolo Abeni
committed
netpoll: prevent hanging NAPI when netcons gets enabled
JIRA: https://issues.redhat.com/browse/RHEL-115597 Upstream commit: commit 2da4def Author: Jakub Kicinski <kuba@kernel.org> Date: Fri Jul 25 18:08:46 2025 -0700 netpoll: prevent hanging NAPI when netcons gets enabled Paolo spotted hangs in NIPA running driver tests against virtio. The tests hang in virtnet_close() -> virtnet_napi_tx_disable(). The problem is only reproducible if running multiple of our tests in sequence (I used TEST_PROGS="xdp.py ping.py netcons_basic.sh \ netpoll_basic.py stats.py"). Initial suspicion was that this is a simple case of double-disable of NAPI, but instrumenting the code reveals: Deadlocked on NAPI ffff888007cd82c0 (virtnet_poll_tx): state: 0x37, disabled: false, owner: 0, listed: false, weight: 64 The NAPI was not in fact disabled, owner is 0 (rather than -1), so the NAPI "thinks" it's scheduled for CPU 0 but it's not listed (!list_empty(&n->poll_list) => false). It seems odd that normal NAPI processing would wedge itself like this. Better suspicion is that netpoll gets enabled while NAPI is polling, and also grabs the NAPI instance. This confuses napi_complete_done(): [netpoll] [normal NAPI] napi_poll() have = netpoll_poll_lock() rcu_access_pointer(dev->npinfo) return NULL # no netpoll __napi_poll() ->poll(->weight) poll_napi() cmpxchg(->poll_owner, -1, cpu) poll_one_napi() set_bit(NAPI_STATE_NPSVC, ->state) napi_complete_done() if (NAPIF_STATE_NPSVC) return false # exit without clearing SCHED This feels very unlikely, but perhaps virtio has some interactions with the hypervisor in the NAPI ->poll that makes the race window larger? Best I could to to prove the theory was to add and trigger this warning in napi_poll (just before netpoll_poll_unlock()): WARN_ONCE(!have && rcu_access_pointer(n->dev->npinfo) && napi_is_scheduled(n) && list_empty(&n->poll_list), "NAPI race with netpoll %px", n); If this warning hits the next virtio_close() will hang. This patch survived 30 test iterations without a hang (without it the longest clean run was around 10). Credit for triggering this goes to Breno's recent netconsole tests. Fixes: 1da177e ("Linux-2.6.12-rc2") Reported-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/c5a93ed1-9abe-4880-a3bb-8d1678018b1d@redhat.com Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://patch.msgid.link/20250726010846.1105875-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
1 parent 233f002 commit c71df5b

File tree

1 file changed

+7
-0
lines changed

1 file changed

+7
-0
lines changed

net/core/netpoll.c

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -784,6 +784,13 @@ int netpoll_setup(struct netpoll *np)
784784
if (err)
785785
goto put;
786786
rtnl_unlock();
787+
788+
/* Make sure all NAPI polls which started before dev->npinfo
789+
* was visible have exited before we start calling NAPI poll.
790+
* NAPI skips locking if dev->npinfo is NULL.
791+
*/
792+
synchronize_rcu();
793+
787794
return 0;
788795

789796
put:

0 commit comments

Comments
 (0)