-
Notifications
You must be signed in to change notification settings - Fork 3.8k
CEP-45: Replication type migration #4508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: cep-45-mutation-tracking
Are you sure you want to change the base?
CEP-45: Replication type migration #4508
Conversation
c765a00 to
43b4b29
Compare
aweisberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing just noticed this
test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java
Outdated
Show resolved
Hide resolved
test/unit/org/apache/cassandra/tcm/transformations/AdvanceMutationTrackingMigrationTest.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java
Show resolved
Hide resolved
aweisberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished review of the implementation code. Still need to finish reviewing the test code.
For range reads you 100% need to adapt InteropTokenRangeTest to test this splitting code. There are a bunch of nasty corner cases when dealing with range reads and what ends up being generated as the bounds of the read command and that covers most of them.
I also think that property based tests would better for the things that are unit testable and maybe even dtests. It would be nice to create a partitioner that has a range of like +5 to -5 but is otherwise identical to a real partitioner and hashes integer values to their real value. Then property tests could enumerate the entire state space very quickly.
src/java/org/apache/cassandra/service/consensus/migration/ConsensusMigrationMutationHelper.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/consensus/migration/ConsensusMigrationMutationHelper.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/replication/migration/MutationTrackingMigrationState.java
Show resolved
Hide resolved
aweisberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also noticed that the check to see if the repair is in an epoch that can effect the ranges seems to be missing in MutationTrackingMigrationState
test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java
Show resolved
Hide resolved
test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java
Show resolved
Hide resolved
test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java
Outdated
Show resolved
Hide resolved
test/unit/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfoTest.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java
Outdated
Show resolved
Hide resolved
aweisberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a happens before edge for repair where when repair starts and snapshots tables that it is guaranteed there are no inflight non-tracked mutations that might not be included in the repair.
The way to ensure this would be for nodes to refuse to apply mutations if they think the write should have been tracked and for repair to then wait out in flight writes after it has updated cluster metadata at each node as part of prepare.
WDYT?
test/unit/org/apache/cassandra/service/replication/migration/MigrationRouterTest.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/replication/migration/KeyspaceMigrationInfo.java
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/service/replication/migration/MigrationRouter.java
Outdated
Show resolved
Hide resolved
test/distributed/org/apache/cassandra/distributed/test/MutationTrackingMigrationTest.java
Show resolved
Hide resolved
...t/org/apache/cassandra/service/replication/migration/MutationTrackingMigrationStateTest.java
Outdated
Show resolved
Hide resolved
...t/org/apache/cassandra/service/replication/migration/MutationTrackingMigrationStateTest.java
Outdated
Show resolved
Hide resolved
test/unit/org/apache/cassandra/tcm/transformations/AdvanceMutationTrackingMigrationTest.java
Outdated
Show resolved
Hide resolved
test/unit/org/apache/cassandra/tcm/transformations/AdvanceMutationTrackingMigrationTest.java
Outdated
Show resolved
Hide resolved
test/unit/org/apache/cassandra/tcm/transformations/AdvanceMutationTrackingMigrationTest.java
Show resolved
Hide resolved
test/unit/org/apache/cassandra/tcm/transformations/AlterSchemaMutationTrackingTest.java
Show resolved
Hide resolved
test/unit/org/apache/cassandra/tcm/transformations/AlterSchemaMutationTrackingTest.java
Outdated
Show resolved
Hide resolved
I added a "wait for epoch" in the prepare phase handler for repair which should serve that purpose. A repair starts, it waits for the epoch, and then no more untracked writes will be coordinated by it by the time it replies |
|
Can you go into more detail about what you’d expect to see for the interop token range test and the property tests? In the case of the interop token range tests, there’s not really an analog for interop reads. Reads are either tracked or untracked. Untracked reads are well tested, and tracked testing / hardening is a separate issue that’s already being worked on. The interop tests don’t check reads in partially migrated clusters afaict. In the case of property tests, we’re mainly checking a handful of state changes driven by schema changes and repair here. We’re not dealing with the sort of multiple intersecting state changes and parameter spaces that property tests are good at uncovering bugs in. I could see value in property based simulator testing of reads and writes and state changes, but I don’t think that’s possible without a lot of simulator work. |
test/unit/org/apache/cassandra/tcm/transformations/AlterSchemaMutationTrackingTest.java
Outdated
Show resolved
Hide resolved
|
RE InteropTokenRange test and testing splitting of range reads InteropTokenRangeTest doesn't demonstrate different permutations of migration, but there a small number of tokens so you can enumerate and test all range migration combinations.
That tells you that the epoch has been received but it doesn't tell you anything about in flight writes that didn't complete yet? For that you would do the op order thing where you have the node update to the epoch and then create an op order to make sure all in flight writes that don't use that epoch are retired. There is still the separate issue of this can cause writes to fail and the errors get surfaced to the client which is not ideal. RE Hints |
src/java/org/apache/cassandra/db/AbstractMutationVerbHandler.java
Outdated
Show resolved
Hide resolved
ba7727b to
edd2fd8
Compare
aweisberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this technically works, but IllegalReplicationTypeException and retrying in the verb handler is overly complicated and throwing CoordinatorBehindException from inside the op order makes more sense. Migrations don't generally flip flop that fast that within the span of time it took the write to get from the verb handler, into the keyspace, and back to the verb handler that the has switch again and we would be able to apply it and generate an acknowledgement.
| catch (IllegalReplicationTypeException e) | ||
| { | ||
| // retry, write may be recoverable if TCM is behind locally | ||
| doVerb(message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of confusing because if we get this exception it means we got all the way into Keyspace to apply the mutation so we had an epoch that thought the coordinator was correct initially, but then we acquired an epoch where that wasn't the case anymore. So it goes into doVerb again with the real goal being to throw CoordinatorBehindException?
I don't think the write would be recoverable in the sense of we will apply here (since it doesn't switch the type of the mutation's replication) but it's recoverable in the sense of if we signal to the coordinator we aren't going to do it (via CoordinatorBehindException) then it will retry.
So maybe just update the comment to highlight the goal of throwing CoordinatorBehindException so the coordinator retries?
Really I would just throw CoordinatorBehindException from inside the the write order instead of introducing IllegalReplicationTypeException because I think IRL it doesn't happen that we retry here and it succeeds. That would require a flip flop of the migration direction which seems really unlikely so better to keep it simple and just throw an error straight back to the coordinator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'd added this to recover in cases where we're behind locally. Here's a possible (though unlikely) scenario
You have a log with 3 changes:
Epoch 1 - tracked
Epoch 2 - untracked
Epoch 3 - tracked.
Coordinator sends tracked mutation with epoch 3. Replica is on epoch 1, but migration router agrees, so the verb handler check passes. Between the check and apply, the local TCM advances to epoch 2, and the check in keyspace#apply fails. However we’re still behind, and the verb handler retries, notices we should be on epoch 3, advances the local log to epoch 3 and the write succeeds.
I could see this happening if an operator updates the replication type and quickly changes it back and TCM replication is having trouble somewhere. What's less clear is if not dropping writes in this edge case is worth the complexity and which is more likely to cause wasted effort doing production investigations. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really unusual to impossible that you have this scenario just because of the time scales involved. This all happens while a single thread is checking and seeing epoch 1, immediately entering Keyspace's write op order and checking and seeing epoch 2, and then unwinding the stack to the verb handler. This entire thing takes single digit milliseconds. It's just not worth the code to be able to respond to the coordinator slightly faster.
I think it's really better to just throw CoordinatorBehindException and let the coordinator retry.
aweisberg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 TY!
939db95 to
7545748
Compare
Thanks for sending a pull request! Here are some tips if you're new here:
Commit messages should follow the following format:
The Cassandra Jira