Skip to content

Conversation

@0xtrooper
Copy link

What:
Use a 5-second context timeout when dialing primary and fallback EC URLs to avoid blocking indefinitely.

Why:
Prevents startup hangs if an endpoint is down or unresponsive.

How:
Introduce dialWithTimeout(url string) helper that utilizes ethclient.DialContext(..) and apply to both primary and fallback dials in NewExecutionClientManager.

0xtrooper and others added 2 commits May 21, 2025 12:58
When establishing connections this avoids having the function stuck trying to connect to the clients
@jshufro
Copy link
Contributor

jshufro commented May 27, 2025

So, I talked to fornax a bit about this offline, but this is sort of a bandaid that won't actually fix the larger problem- that we abstracted away the context fields in all of these requests within the client.

When the rocketpool binary starts, we should create a cancelable parent context for all actions, and make children using WithTimeout. This would be an enormous refactor, but it would mean that context-aware timeouts would be possible.

5 seconds seems too short for a global dial timeout. ECs that are in the middle of syncing, for example, are still dialed to get sync status, but may take more than 5 seconds to reply.

I think it's a little too dangerous to apply a unilateral 5 second timeout here.

@0xtrooper
Copy link
Author

I agree that a global context would make a lot of sense. Would it be reasonable to have a (longer) timeout for the fallback client only? Feels a bit strange to have the UI not respond because the fallback EC has some issues.

@jshufro
Copy link
Contributor

jshufro commented May 28, 2025

I agree that a global context would make a lot of sense. Would it be reasonable to have a (longer) timeout for the fallback client only? Feels a bit strange to have the UI not respond because the fallback EC has some issues.

Sure, that would make sense. I think at some point I also want to add a feature that treats the fallback ec as the primary for queries in rocketpool_node and rocketpool_api so that things like collecting metrics don't use the primary ec if a fallback is available, but adding a timeout shouldn't really hurt that use-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants