Skip to content

Conversation

@alpe
Copy link
Contributor

@alpe alpe commented Dec 16, 2025

Overview

State should be same after restart.

Note to reviewer:
This PR contains is a breaking change the the queue store. I am not sure if we need a migration.

@alpe alpe marked this pull request as ready for review December 16, 2025 14:18
@github-actions
Copy link
Contributor

github-actions bot commented Dec 16, 2025

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedDec 19, 2025, 8:31 PM

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alpe, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an issue where the BatchQueue could lose its deterministic state after a restart. By implementing a new mechanism for generating and managing sequential keys for persisted batches, the queue can now reliably restore its exact order and content from storage, ensuring consistent behavior and preventing potential data reordering or loss in crash recovery scenarios.

Highlights

  • Deterministic Persistence: Batches are now persisted using deterministically generated sequential keys instead of content hashes, ensuring the queue state is consistent across restarts.
  • Queued Item Struct: A new queuedItem struct has been introduced to store both the batch and its unique persistence key, allowing the queue to manage and retrieve items based on their ordered keys.
  • Sequence Number Management: The BatchQueue now tracks nextAddSeq and nextPrependSeq to generate unique, ordered keys for new batches. These sequence counters are correctly re-initialized upon loading from persistent storage to prevent key collisions.
  • Ordered Loading: The Load operation has been enhanced to query the datastore with OrderByKey, ensuring batches are loaded in their correct order. It also reconstructs the in-memory queue and sequence counters based on the persisted, ordered keys.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link
Contributor

claude bot commented Dec 16, 2025

Claude finished @alpe's task in 4m 6s —— View job


Code Review Complete

I've completed a comprehensive review of the deterministic queue implementation. This PR makes an important architectural improvement by switching from content-hash-based keys to sequence-number-based keys, ensuring deterministic queue restoration after restarts.


Summary

Overall Assessment: APPROVED with recommendations

The implementation is well-designed and addresses a critical issue with queue state persistence. The code is production-ready with good test coverage. My main recommendations focus on improving observability (structured logging), error handling resilience, and addressing a few edge cases.


1. Code Quality & Architecture ✅

Strengths:

  • Clean sequence number design: Using initialSeqNum = 0x8000000000000000 (middle of uint64 range) is elegant, allowing both append (increment) and prepend (decrement) operations without collision risk
  • Deterministic key generation: seqToKey() using big-endian encoding ensures lexicographical sort matches numeric order - crucial for correct queue ordering
  • Migration strategy: Automatic detection and migration of legacy hash-based keys is well-implemented
  • Memory efficiency: The compaction logic in Next() (sequencers/single/queue.go:146-154) prevents memory leaks by releasing processed items

Concerns:

  1. Missing structured logging (echoes gemini-code-assist's feedback)

    • Lines 160, 189, 199, 228, 234, 239 use fmt.Printf for error logging
    • Impact: In production, these errors go to stdout with no severity levels, making monitoring difficult
    • Recommendation: Inject a structured logger (e.g., log/slog, zap) into BatchQueue for proper observability
    • Priority: HIGH - critical for production debugging
  2. Sequence counter overflow handling

    • While uint64 provides massive headroom (~18 quintillion operations in each direction), there's no overflow detection
    • Scenario: If nextPrependSeq underflows or nextAddSeq overflows, keys would wrap around and potentially collide
    • Recommendation: Add overflow checks before incrementing/decrementing sequences:
      if bq.nextAddSeq == math.MaxUint64 {
          return fmt.Errorf("sequence number overflow")
      }
    • Priority: LOW - extremely unlikely in practice, but good defensive programming

2. Error Handling & Resilience ⚠️

Strengths:

  • Error wrapping is consistent (line 182: fmt.Errorf("error querying datastore: %w", err))
  • Load() gracefully handles malformed entries by logging and continuing (lines 188-191, 198-201)

Concerns:

  1. Silent error handling in critical paths (sequencers/single/queue.go:158-161)

    if err := bq.db.Delete(ctx, ds.NewKey(item.Key)); err != nil {
        // Log the error but continue
        fmt.Printf("Error deleting processed batch: %v\n", err)
    }
    • Issue: If deletion fails, the batch remains in persistent storage and will be reloaded on restart, causing duplicate processing
    • Impact: Queue state diverges from disk state, potentially leading to transaction reprocessing after crash
    • Recommendation: Either:
      • Return the error to caller (breaking change, more correct)
      • Track failed deletions and retry on next Load()
      • Add metrics/alerts for deletion failures
    • Priority: HIGH - affects data integrity
  2. Migration error handling (sequencers/single/queue.go:233-240)

    if err := bq.persistBatch(ctx, item.Batch, newKeyName); err != nil {
        fmt.Printf("Failed to migrate legacy item %s: %v\n", item.Key, err)
        continue  // Skips migration but doesn't track failure
    }
    • Issue: Failed migrations are silently dropped - the legacy item stays in DB but isn't loaded into queue
    • Impact: Transaction loss on migration failure
    • Recommendation:
      • Return error if any migration fails OR
      • Track partial migration state and retry on next startup
    • Priority: MEDIUM - migration is one-time but critical
  3. Context cancellation not checked in loops

    • The Load() loop (lines 187-224) doesn't check ctx.Done(), potentially blocking shutdown
    • Recommendation: Add select { case <-ctx.Done(): return ctx.Err(); default: } in the loop
    • Priority: LOW - Load() is typically fast

3. Security 🔒

Overall: Good, no major concerns

  • ✅ No credential handling or sensitive data exposure
  • ✅ Input validation present (key length check on line 208: len(keyName) == 16)
  • ✅ No SQL injection risk (protobuf-based storage)
  • ✅ No obvious DoS vectors (maxQueueSize limits memory usage)

Minor observation:

  • The migration code (lines 228-244) processes all legacy items without batching, which could cause memory pressure if thousands of legacy items exist
  • Recommendation: Consider processing migrations in batches if you expect large legacy datasets
  • Priority: LOW - unlikely scenario

4. Performance & Resource Efficiency ⚡

Strengths:

  • Efficient compaction heuristic (line 146): Only compacts when head > len(queue)/2 && head > 100, avoiding frequent reallocations
  • Big-endian encoding allows direct lexicographical ordering without conversion
  • Prepend optimization (lines 116-122): Reuses head space when available instead of always prepending to slice

Concerns:

  1. OrderByKey query cost (line 178)

    q := query.Query{
        Orders: []query.Order{query.OrderByKey{}},
    }
    • Depending on datastore backend, this could require full key sort on every Load()
    • Recommendation: Document expected datastore performance characteristics
    • Priority: LOW - likely acceptable for restart scenarios
  2. Memory zeroing overhead (lines 149-151)

    for i := remaining; i < len(bq.queue); i++ {
        bq.queue[i] = queuedItem{}
    }
    • This zeroing loop is good for GC but adds O(n) overhead
    • Alternative: bq.queue = bq.queue[:remaining:remaining] to prevent slice reuse (forces new allocation on append)
    • Priority: VERY LOW - current approach is safer

5. Testing & Reliability 🧪

Strengths:

  • ✅ Excellent test coverage with migration-specific tests (queue_migration_test.go)
  • TestBatchQueue_Load_SetsSequencesProperly (line 309) validates critical sequence counter restoration
  • ✅ Concurrency tests present (TestConcurrency, TestBatchQueue_QueueLimit_Concurrency)
  • ✅ Edge cases covered: empty queue, multiple prepends, persistence across restarts

Observations:

  1. Codecov shows 83.60% patch coverage with 10 lines missing

    • Missing lines likely in error paths (migration failures, delete errors)
    • Recommendation: Add tests that inject datastore failures to cover error paths
    • Priority: MEDIUM - important for production confidence
  2. Migration test validation (queue_migration_test.go:106-108)

    // We expect 0 keys in DB because we called Next() 3 times which deletes them from DB as well!
    require.Empty(keys, "Expected DB to be empty after processing all items")
    • This test validates successful migration, but doesn't test failure scenarios
    • Recommendation: Add test for partial migration failure (mock datastore errors)
    • Priority: MEDIUM
  3. Sequence collision test missing

    • No test validates behavior when loading items with sequence numbers near overflow boundaries
    • Recommendation: Add test with sequences near 0 and math.MaxUint64
    • Priority: LOW - edge case

6. Documentation & Developer Experience 📚

Strengths:

  • ✅ Clear comment on initialSeqNum explaining the design choice (lines 25-28)
  • ✅ Function comments explain behavior (e.g., Prepend bypassing maxQueueSize, lines 100-101)
  • ✅ Migration test comments explain expected behavior

Recommendations:

  1. Document migration process

    • Add a comment in Load() explaining the migration logic and what happens to legacy keys
    • Mention that migration is automatic and one-time
    • Priority: MEDIUM
  2. Document deletion failure implications

    • Add comment at line 158 explaining why deletion errors are logged but not returned
    • Document the potential for duplicate processing on restart if deletion fails
    • Priority: MEDIUM
  3. Breaking change documentation

    • The PR description mentions this is a breaking change but doesn't specify impact
    • Recommendation: Add a comment or doc explaining:
      • Old queues will auto-migrate on first Load()
      • No manual migration needed
      • Old hash-based keys are deleted after successful migration
    • Priority: HIGH - critical for operators

7. Concurrency Safety 🔐

Assessment: Well-protected

  • ✅ All public methods acquire bq.mu lock
  • Load() resets sequences under lock (lines 168-175)
  • ✅ Test coverage for concurrent operations (TestConcurrency)

No issues found in this area.


Critical Issues Summary

Priority Issue Location Recommendation
HIGH Structured logging missing Lines 160, 189, 199, 228, 234, 239 Inject logger into BatchQueue
HIGH Deletion failures cause state drift Line 158-161 Track failures or return error
HIGH Breaking change not documented PR description Document migration behavior
MEDIUM Migration failures silently drop data Lines 233-240 Return error or track failures
MEDIUM Missing error path test coverage N/A Add datastore failure injection tests

Minor Suggestions

  1. Line 186: Consider extracting magic number 16 to a constant:

    const seqKeyLength = 16  // hex-encoded uint64
  2. Line 208: The validation logic is buried in a loop - consider extracting:

    func isValidSeqKey(key string) (uint64, bool) {
        if len(key) != seqKeyLength {
            return 0, false
        }
        seq, err := strconv.ParseUint(key, 16, 64)
        return seq, err == nil
    }
  3. Prepend comment (line 100): The comment says "NOTE: Prepend intentionally bypasses..." but doesn't explain why. Consider adding rationale (prevents transaction loss when batch size limits are hit).


Migration Consideration (Response to PR Author's Question)

"I am not sure if we need a migration."

Answer: You already have automatic migration

The code in Load() (lines 186-246) implements automatic migration:

  1. Detects legacy keys (non-16-char hex strings)
  2. Re-persists with new sequence-based keys
  3. Deletes old hash-based keys
  4. Updates sequence counters

This is the correct approach. Manual migration scripts would be more complex and error-prone.

However, consider adding:

  1. Migration observability: Emit a log/metric when migration happens with count of migrated items
  2. Migration validation: After migration, verify all legacy keys are gone (add to test)
  3. Rollback documentation: If migration fails and operators need to rollback, document that they'd lose the deterministic queue benefit

Reference: Issue #2795

Per @julienrbrt's comment, this relates to #2795 for tracking legacy code removal before v1. The legacy key detection code (lines 205-223) should be tracked for eventual removal once all deployments have migrated.


Final Recommendation

APPROVE with the following action items before merge:

  1. MUST: Add structured logging (addresses gemini-code-assist feedback + production needs)
  2. SHOULD: Improve deletion error handling in Next() to prevent state drift
  3. SHOULD: Document the breaking change and migration behavior for operators
  4. NICE-TO-HAVE: Add error injection tests to improve coverage to >90%

The core logic is sound and the tests demonstrate correctness. The issues identified are about production robustness and observability.


Great work on solving the deterministic queue restoration problem! The sequence number approach is much more robust than content hashing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the persistent queue to use deterministic sequence numbers for keys instead of content-based hashes. This is a significant improvement for robustness, ensuring the queue state can be reliably restored after a restart. The changes are well-implemented, including updating the Load function to correctly handle the new key scheme and resume sequencing. A new test case has been added to validate this new logic. My main feedback is to replace fmt.Printf calls with a structured logger for better observability in production.

Comment on lines 190 to 218
for result := range results.Next() {
if result.Error != nil {
fmt.Printf("Error reading entry from datastore: %v\n", result.Error)
continue
}
// We care about the last part of the key (the sequence number)
// ds.Key usually has a leading slash.
key := ds.NewKey(result.Key).Name()

pbBatch := &pb.Batch{}
err := proto.Unmarshal(result.Value, pbBatch)
if err != nil {
fmt.Printf("Error decoding batch for key '%s': %v. Skipping entry.\n", result.Key, err)
fmt.Printf("Error decoding batch for key '%s': %v. Skipping entry.\n", key, err)
continue
}
bq.queue = append(bq.queue, coresequencer.Batch{Transactions: pbBatch.Txs})

batch := coresequencer.Batch{Transactions: pbBatch.Txs}
bq.queue = append(bq.queue, queuedItem{Batch: batch, Key: key})

// Update sequences based on loaded keys to avoid collisions
if seq, err := strconv.ParseUint(key, 16, 64); err == nil {
if seq >= bq.nextAddSeq {
bq.nextAddSeq = seq + 1
}
if seq <= bq.nextPrependSeq {
bq.nextPrependSeq = seq - 1
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Throughout the Load and Next functions, errors are logged using fmt.Printf. In a library or production service, it's better to use a structured logging library (e.g., log/slog, zap, zerolog). This provides several benefits:

  • Controllable Output: Consumers of the package can direct logs to files, network services, etc.
  • Log Levels: You can assign severity levels (e.g., error, warn, info) to messages.
  • Structured Data: You can include key-value pairs (like the key in this case) for easier parsing and filtering by log analysis tools.

Consider injecting a logger into the BatchQueue to handle these cases.

@codecov
Copy link

codecov bot commented Dec 16, 2025

Codecov Report

❌ Patch coverage is 83.60656% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.19%. Comparing base (742b8bb) to head (ff87318).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sequencers/single/queue.go 83.60% 5 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2938      +/-   ##
==========================================
+ Coverage   59.06%   59.19%   +0.13%     
==========================================
  Files          90       90              
  Lines        8596     8627      +31     
==========================================
+ Hits         5077     5107      +30     
- Misses       2937     2938       +1     
  Partials      582      582              
Flag Coverage Δ
combined 59.19% <83.60%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@julienrbrt julienrbrt changed the title fix: deterministic queue fix(sequencers/single): deterministic queue Dec 16, 2025
julienrbrt
julienrbrt previously approved these changes Dec 18, 2025
julienrbrt
julienrbrt previously approved these changes Dec 19, 2025
@julienrbrt julienrbrt enabled auto-merge December 19, 2025 20:15
@julienrbrt julienrbrt disabled auto-merge December 19, 2025 20:31
@julienrbrt julienrbrt merged commit d4a2da9 into main Dec 19, 2025
19 of 20 checks passed
@julienrbrt julienrbrt deleted the alex/deterministic_queue branch December 19, 2025 20:31
alpe added a commit that referenced this pull request Dec 22, 2025
* main:
  fix: make evm_execution more robust (#2942)
  fix(sequencers/single): deterministic queue (#2938)
  fix(block): fix init logic sequencer for da epoch fetching (#2926)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants