CEP-45: Replication type migration #4508

bdeggleston · 2025-12-02T18:06:58Z

Thanks for sending a pull request! Here are some tips if you're new here:

Ensure you have added or run the appropriate tests for your PR.
Be sure to keep the PR description updated to reflect all changes.
Write your PR title to summarize what this PR proposes.
If possible, provide a concise example to reproduce the issue for a faster review.
Read our contributor guidelines
If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

aweisberg

Still reviewing just noticed this

test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java

test/unit/org/apache/cassandra/tcm/transformations/AdvanceMutationTrackingMigrationTest.java

src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java

src/java/org/apache/cassandra/db/Mutation.java

aweisberg

Finished review of the implementation code. Still need to finish reviewing the test code.

For range reads you 100% need to adapt InteropTokenRangeTest to test this splitting code. There are a bunch of nasty corner cases when dealing with range reads and what ends up being generated as the bounds of the read command and that covers most of them.

I also think that property based tests would better for the things that are unit testable and maybe even dtests. It would be nice to create a partitioner that has a range of like +5 to -5 but is otherwise identical to a real partitioner and hashes integer values to their real value. Then property tests could enumerate the entire state space very quickly.

src/java/org/apache/cassandra/journal/Segments.java

src/java/org/apache/cassandra/service/consensus/migration/ConsensusMigrationMutationHelper.java

src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java

src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java

src/java/org/apache/cassandra/db/Keyspace.java

src/java/org/apache/cassandra/service/replication/migration/MutationTrackingMigrationState.java

src/java/org/apache/cassandra/tcm/transformations/AlterSchema.java

aweisberg

Also noticed that the check to see if the repair is in an epoch that can effect the ranges seems to be missing in MutationTrackingMigrationState

test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java

src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java

aweisberg

I don't see a happens before edge for repair where when repair starts and snapshots tables that it is guaranteed there are no inflight non-tracked mutations that might not be included in the repair.

The way to ensure this would be for nodes to refuse to apply mutations if they think the write should have been tracked and for repair to then wait out in flight writes after it has updated cluster metadata at each node as part of prepare.

WDYT?

test/unit/org/apache/cassandra/service/replication/migration/MigrationRouterTest.java

src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java

src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java

test/distributed/org/apache/cassandra/distributed/test/MutationTrackingMigrationTest.java

...t/org/apache/cassandra/service/replication/migration/MutationTrackingMigrationStateTest.java

test/unit/org/apache/cassandra/tcm/transformations/AdvanceMutationTrackingMigrationTest.java

test/unit/org/apache/cassandra/tcm/transformations/AlterSchemaMutationTrackingTest.java

bdeggleston · 2025-12-09T18:47:38Z

I don't see a happens before edge for repair where when repair starts and snapshots tables that it is guaranteed there are no inflight non-tracked mutations that might not be included in the repair.

The way to ensure this would be for nodes to refuse to apply mutations if they think the write should have been tracked and for repair to then wait out in flight writes after it has updated cluster metadata at each node as part of prepare.

WDYT?

I added a "wait for epoch" in the prepare phase handler for repair which should serve that purpose. A repair starts, it waits for the epoch, and then no more untracked writes will be coordinated by it by the time it replies

bdeggleston · 2025-12-10T18:43:25Z

Can you go into more detail about what you’d expect to see for the interop token range test and the property tests?

In the case of the interop token range tests, there’s not really an analog for interop reads. Reads are either tracked or untracked. Untracked reads are well tested, and tracked testing / hardening is a separate issue that’s already being worked on. The interop tests don’t check reads in partially migrated clusters afaict.

In the case of property tests, we’re mainly checking a handful of state changes driven by schema changes and repair here. We’re not dealing with the sort of multiple intersecting state changes and parameter spaces that property tests are good at uncovering bugs in. I could see value in property based simulator testing of reads and writes and state changes, but I don’t think that’s possible without a lot of simulator work.

src/java/org/apache/cassandra/db/Keyspace.java

test/unit/org/apache/cassandra/tcm/transformations/AlterSchemaMutationTrackingTest.java

aweisberg · 2025-12-11T18:09:48Z

RE InteropTokenRange test and testing splitting of range reads
The take away from InteropTokenRangeTest is that you test queries with those token bounds and GT/GTE/LT/LTE to make sure that your migration code correctly splits range reads across tracked/untracked boundaries. There are just a ton of edge cases there. The thing you are testing here is the splitting and that the commands it outputs are valid and actually return the data that is supposed to be in the read.

InteropTokenRangeTest doesn't demonstrate different permutations of migration, but there a small number of tokens so you can enumerate and test all range migration combinations.

I added a "wait for epoch" in the prepare phase handler for repair which should serve that purpose. A repair starts, it waits for the epoch, and then no more untracked writes will be coordinated by it by the time it replies

That tells you that the epoch has been received but it doesn't tell you anything about in flight writes that didn't complete yet? For that you would do the op order thing where you have the node update to the epoch and then create an op order to make sure all in flight writes that don't use that epoch are retired.

There is still the separate issue of this can cause writes to fail and the errors get surfaced to the client which is not ideal.

RE Hints
Right now hints will get stuck for all keyspaces if you enable mutation tracking and there are hints for a keyspace with mutation tracking. That seems problematic. Hints are host oriented and contain a log of hints for multiple keyspaces. There doesn't seem to be graceful cleanup of hints for keyspaces that have mutation tracking enabled so hint delivery for other keyspaces can continue.

src/java/org/apache/cassandra/hints/Hint.java

src/java/org/apache/cassandra/hints/HintsDispatcher.java

src/java/org/apache/cassandra/db/AbstractMutationVerbHandler.java

aweisberg

I think this technically works, but IllegalReplicationTypeException and retrying in the verb handler is overly complicated and throwing CoordinatorBehindException from inside the op order makes more sense. Migrations don't generally flip flop that fast that within the span of time it took the write to get from the verb handler, into the keyspace, and back to the verb handler that the has switch again and we would be able to apply it and generate an acknowledgement.

aweisberg · 2025-12-18T17:20:54Z

src/java/org/apache/cassandra/db/AbstractMutationVerbHandler.java

+        catch (IllegalReplicationTypeException e)
+        {
+            // retry, write may be recoverable if TCM is behind locally
+            doVerb(message);


This is kind of confusing because if we get this exception it means we got all the way into Keyspace to apply the mutation so we had an epoch that thought the coordinator was correct initially, but then we acquired an epoch where that wasn't the case anymore. So it goes into doVerb again with the real goal being to throw CoordinatorBehindException?

I don't think the write would be recoverable in the sense of we will apply here (since it doesn't switch the type of the mutation's replication) but it's recoverable in the sense of if we signal to the coordinator we aren't going to do it (via CoordinatorBehindException) then it will retry.

So maybe just update the comment to highlight the goal of throwing CoordinatorBehindException so the coordinator retries?

Really I would just throw CoordinatorBehindException from inside the the write order instead of introducing IllegalReplicationTypeException because I think IRL it doesn't happen that we retry here and it succeeds. That would require a flip flop of the migration direction which seems really unlikely so better to keep it simple and just throw an error straight back to the coordinator.

Right, I'd added this to recover in cases where we're behind locally. Here's a possible (though unlikely) scenario

You have a log with 3 changes:
Epoch 1 - tracked
Epoch 2 - untracked
Epoch 3 - tracked.

Coordinator sends tracked mutation with epoch 3. Replica is on epoch 1, but migration router agrees, so the verb handler check passes. Between the check and apply, the local TCM advances to epoch 2, and the check in keyspace#apply fails. However we’re still behind, and the verb handler retries, notices we should be on epoch 3, advances the local log to epoch 3 and the write succeeds.

I could see this happening if an operator updates the replication type and quickly changes it back and TCM replication is having trouble somewhere. What's less clear is if not dropping writes in this edge case is worth the complexity and which is more likely to cause wasted effort doing production investigations. WDYT?

It's really unusual to impossible that you have this scenario just because of the time scales involved. This all happens while a single thread is checking and seeing epoch 1, immediately entering Keyspace's write op order and checking and seeing epoch 2, and then unwinding the stack to the verb handler. This entire thing takes single digit milliseconds. It's just not worth the code to be able to respond to the coordinator slightly faster.

I think it's really better to just throw CoordinatorBehindException and let the coordinator retry.

aweisberg

+1 TY!

bdeggleston requested a review from aweisberg December 2, 2025 18:06

bdeggleston force-pushed the C20388-migration branch from c765a00 to 43b4b29 Compare December 2, 2025 22:07

aweisberg requested changes Dec 2, 2025

View reviewed changes

test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java Outdated Show resolved Hide resolved

test/unit/org/apache/cassandra/tcm/transformations/AdvanceMutationTrackingMigrationTest.java Outdated Show resolved Hide resolved

aweisberg self-requested a review December 2, 2025 22:14

aweisberg reviewed Dec 2, 2025

View reviewed changes

src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java Show resolved Hide resolved

aweisberg reviewed Dec 3, 2025

View reviewed changes

src/java/org/apache/cassandra/db/Mutation.java Outdated Show resolved Hide resolved

aweisberg requested changes Dec 4, 2025

View reviewed changes