deflake TestCacheLaggingWatcher by waiting for resync #20931

apullo777 · 2025-11-15T14:09:22Z

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

To address #20852

This PR tries to deflake TestCacheLaggingWatcher/pipeline_overflow by adding a short wait before asserting on the collected events.

@serathius

k8s-ci-robot · 2025-11-15T14:09:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: apullo777
Once this PR has been reviewed and has the lgtm label, please assign ivanvc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-11-15T14:09:32Z

Hi @apullo777. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

serathius · 2025-11-15T17:32:28Z

tests/integration/cache_test.go


 			generateEvents(t, client, prefix, tt.eventCount)
+
+			if tt.name == "pipeline overflow" {


Please don't depend on hardcoded test names, instead add a delay argument into the test scenario.

Thanks. I will add a new field postGenerateDelay time.Duration to the test case struct, set it only to this test scenario, and do below:

generateEvents(t, client, prefix, tt.eventCount) if tt.postGenerateDelay > 0 { time.Sleep(tt.postGenerateDelay) }

Signed-off-by: Peter Chang <peter.yaochen.chang@gmail.com>

serathius · 2025-11-17T17:20:00Z

This barely lowers the flakiness on my machine from 9% to 7%. Adding random sleeps is not a solution.

apullo777 · 2025-11-22T00:12:52Z

This barely lowers the flakiness on my machine from 9% to 7%. Adding random sleeps is not a solution.

I just realized I cannot reproduce the specific CI flakes locally.

I noticed my local stress failures (which also affect events_fill_window and event_fill_pipeline) do not have the getWatch failed, will retry after 50ms: cache: stale event batch (rev 32 < latest 42) log. I suspect my local failures are primarily race conditions between the test reader and the resync loop.

In contrast, the CI failure seems to point to something deeper, perhaps related to our watch retry logic reusing a stale start revision (broadcasting event at rev 32 with demux.maxRev=42)?

Are you able to reproduce that specific CI flake locally? And would you suggest I focus on finding a way to stabilize the test races first, or prioritize finding a reproduction path for the CI flake?

rockswe · 2025-11-22T06:41:13Z

I've dug deeper and it looks like the root cause is in the cache internals: the internal store watcher was using the same buffer size as client watchers, causing it to lag during burst traffic. When it lagged, the resync mechanism tried to replay stale events (e.g., rev 32) to a store that had already progressed (rev 42), causing the validation error.

I implemented a fix that:

Gives the internal store watcher a dedicated buffer (auto-sized to 50x larger)
Prevents it from ever being resynced (proper restart path instead)
Scales to production workloads without cascading failures

The fix passes 100% in stress testing (20+ consecutive runs). I'll open a new PR with this root-cause fix and link it here. Thanks for investigating this! Your work helped identify the pattern.

k8s-ci-robot added area/testing needs-ok-to-test labels Nov 15, 2025

k8s-ci-robot added the size/XS label Nov 15, 2025

serathius reviewed Nov 15, 2025

View reviewed changes

deflake TestCacheLaggingWatcher by waiting for resync

91435ce

Signed-off-by: Peter Chang <peter.yaochen.chang@gmail.com>

apullo777 force-pushed the deflake-lagging branch from 366acdd to 91435ce Compare November 16, 2025 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

deflake TestCacheLaggingWatcher by waiting for resync #20931

deflake TestCacheLaggingWatcher by waiting for resync #20931

Uh oh!

apullo777 commented Nov 15, 2025

Uh oh!

k8s-ci-robot commented Nov 15, 2025

Uh oh!

k8s-ci-robot commented Nov 15, 2025

Uh oh!

serathius Nov 15, 2025 •

edited

Loading

Uh oh!

apullo777 Nov 15, 2025

Uh oh!

serathius commented Nov 17, 2025

Uh oh!

apullo777 commented Nov 22, 2025

Uh oh!

rockswe commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants


		generateEvents(t, client, prefix, tt.eventCount)

		if tt.name == "pipeline overflow" {

deflake TestCacheLaggingWatcher by waiting for resync #20931

Are you sure you want to change the base?

deflake TestCacheLaggingWatcher by waiting for resync #20931

Uh oh!

Conversation

apullo777 commented Nov 15, 2025

Uh oh!

k8s-ci-robot commented Nov 15, 2025

Uh oh!

k8s-ci-robot commented Nov 15, 2025

Uh oh!

serathius Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apullo777 Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

serathius commented Nov 17, 2025

Uh oh!

apullo777 commented Nov 22, 2025

Uh oh!

rockswe commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

serathius Nov 15, 2025 •

edited

Loading