Skip to content

Conversation

@apullo777
Copy link
Contributor

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

To address #20852

This PR tries to deflake TestCacheLaggingWatcher/pipeline_overflow by adding a short wait before asserting on the collected events.

@serathius

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: apullo777
Once this PR has been reviewed and has the lgtm label, please assign ivanvc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link

Hi @apullo777. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.


generateEvents(t, client, prefix, tt.eventCount)

if tt.name == "pipeline overflow" {
Copy link
Member

@serathius serathius Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't depend on hardcoded test names, instead add a delay argument into the test scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will add a new field postGenerateDelay time.Duration to the test case struct, set it only to this test scenario, and do below:

            generateEvents(t, client, prefix, tt.eventCount)

			if tt.postGenerateDelay > 0 {
				time.Sleep(tt.postGenerateDelay)
			}

Signed-off-by: Peter Chang <peter.yaochen.chang@gmail.com>
@serathius
Copy link
Member

This barely lowers the flakiness on my machine from 9% to 7%. Adding random sleeps is not a solution.

@apullo777
Copy link
Contributor Author

This barely lowers the flakiness on my machine from 9% to 7%. Adding random sleeps is not a solution.

I just realized I cannot reproduce the specific CI flakes locally.

I noticed my local stress failures (which also affect events_fill_window and event_fill_pipeline) do not have the getWatch failed, will retry after 50ms: cache: stale event batch (rev 32 < latest 42) log. I suspect my local failures are primarily race conditions between the test reader and the resync loop.

In contrast, the CI failure seems to point to something deeper, perhaps related to our watch retry logic reusing a stale start revision (broadcasting event at rev 32 with demux.maxRev=42)?

Are you able to reproduce that specific CI flake locally? And would you suggest I focus on finding a way to stabilize the test races first, or prioritize finding a reproduction path for the CI flake?

@rockswe
Copy link
Contributor

rockswe commented Nov 22, 2025

I've dug deeper and it looks like the root cause is in the cache internals: the internal store watcher was using the same buffer size as client watchers, causing it to lag during burst traffic. When it lagged, the resync mechanism tried to replay stale events (e.g., rev 32) to a store that had already progressed (rev 42), causing the validation error.

I implemented a fix that:

  1. Gives the internal store watcher a dedicated buffer (auto-sized to 50x larger)
  2. Prevents it from ever being resynced (proper restart path instead)
  3. Scales to production workloads without cascading failures

The fix passes 100% in stress testing (20+ consecutive runs). I'll open a new PR with this root-cause fix and link it here. Thanks for investigating this! Your work helped identify the pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants