-
Notifications
You must be signed in to change notification settings - Fork 10.2k
deflake TestCacheLaggingWatcher by waiting for resync #20931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: apullo777 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Hi @apullo777. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
tests/integration/cache_test.go
Outdated
|
|
||
| generateEvents(t, client, prefix, tt.eventCount) | ||
|
|
||
| if tt.name == "pipeline overflow" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't depend on hardcoded test names, instead add a delay argument into the test scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will add a new field postGenerateDelay time.Duration to the test case struct, set it only to this test scenario, and do below:
generateEvents(t, client, prefix, tt.eventCount)
if tt.postGenerateDelay > 0 {
time.Sleep(tt.postGenerateDelay)
}
Signed-off-by: Peter Chang <peter.yaochen.chang@gmail.com>
366acdd to
91435ce
Compare
|
This barely lowers the flakiness on my machine from 9% to 7%. Adding random sleeps is not a solution. |
I just realized I cannot reproduce the specific CI flakes locally. I noticed my local stress failures (which also affect In contrast, the CI failure seems to point to something deeper, perhaps related to our watch retry logic reusing a stale start revision (broadcasting event at rev 32 with Are you able to reproduce that specific CI flake locally? And would you suggest I focus on finding a way to stabilize the test races first, or prioritize finding a reproduction path for the CI flake? |
|
I've dug deeper and it looks like the root cause is in the cache internals: the internal store watcher was using the same buffer size as client watchers, causing it to lag during burst traffic. When it lagged, the resync mechanism tried to replay stale events (e.g., rev 32) to a store that had already progressed (rev 42), causing the validation error. I implemented a fix that:
The fix passes 100% in stress testing (20+ consecutive runs). I'll open a new PR with this root-cause fix and link it here. Thanks for investigating this! Your work helped identify the pattern. |
Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.
To address #20852
This PR tries to deflake
TestCacheLaggingWatcher/pipeline_overflowby adding a short wait before asserting on the collected events.@serathius