-
Notifications
You must be signed in to change notification settings - Fork 3.8k
CASSANDRA-21078 move training params to CQL #4523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
4d0a88f to
49c5bc6
Compare
src/java/org/apache/cassandra/db/compression/CompressionDictionaryManagerMBean.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/io/compress/ZstdDictionaryCompressor.java
Outdated
Show resolved
Hide resolved
yifan-c
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good overall. As discussed offline, it makes better sense to have those parameters as part of compression table attribute.
src/java/org/apache/cassandra/tools/nodetool/CompressionDictionaryCommandGroup.java
Show resolved
Hide resolved
doc/modules/cassandra/pages/managing/operating/compression.adoc
Outdated
Show resolved
Hide resolved
69abf82 to
a353297
Compare
42d165d to
6748c80
Compare
jyothsnakonisa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall a few minor comments around inconsistencies in documentation and naming around sampling rate parameter
.circleci/config.yml
Outdated
| working_directory: ~/ | ||
| shell: /bin/bash -eo pipefail -l | ||
| parallelism: 4 | ||
| parallelism: 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you checkin these changes in the file by accident?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a common practice to check-in the Circle config to run CI with more resources. The particular commit (typically named "do not commit") will be dropped when committing the patch.
| } | ||
|
|
||
| @VisibleForTesting | ||
| public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingRate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are passing 1/samplingRate here, should the name of the variable be percentage? or any other appropriate name
| public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingRate) | |
| public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingPercentage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep it as it is.
src/java/org/apache/cassandra/db/compression/ZstdDictionaryTrainer.java
Outdated
Show resolved
Hide resolved
6748c80 to
6345c36
Compare
Previously, if you had a sample rate of 0.01, then Math.round(1 / 0.01) = 100 and shouldSample method was doing ThreadLocalRandom.current().nextInt(samplingRate) == 0 which picked a number from 0 (inclusive) to 100 (exclusive). However, if we set sample rate to, for example, 0.75 to say that 75% should be sampled, then Math.round(1 / 0.75) = 1.33 and rounded it is 1. ThreadLocalRandom.current().nextInt(1) == 0 will be always true. That means what basically from some sample rate which rounds to 1 we lose the probability. The current approach works with floats and it is rewritten to ThreadLocalRandom.current().nextFloat() < samplingRate nextFloat() gives values between zero (inclusive) and one (exclusive). If we set sampling rate to 1 then it will be always true. If we set it to 0.01 that will be 1%. If we set it to 0.75 that will be 75%, without losing any accuracy.
src/java/org/apache/cassandra/tools/nodetool/CompressionDictionaryCommandGroup.java
Outdated
Show resolved
Hide resolved
doc/modules/cassandra/pages/managing/operating/compression.adoc
Outdated
Show resolved
Hide resolved
jyothsnakonisa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good mostly, left few minor comments about threadsafety.
| private final String keyspaceName; | ||
| private final String tableName; | ||
| private final CompressionDictionaryTrainingConfig config; | ||
| private volatile CompressionDictionaryTrainingConfig config; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since config can be reset to null, can you add a null check before using it? I think you should add the check in buildNotReadyMessage() method and please check if it should be added anywhere else.
| public boolean shouldSample() | ||
| { | ||
| return zstdTrainer != null && ThreadLocalRandom.current().nextInt(samplingRate) == 0; | ||
| return zstdTrainer != null && ThreadLocalRandom.current().nextFloat() < samplingRate; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make zstdTrainer volatile? It is updated using synchronized but while reading there is no synchronized. We can still read stale value even after synchronized update.
| sampleCount.set(0); | ||
| zstdTrainer = new ZstdDictTrainer(config.maxTotalSampleSize, config.maxDictionarySize, compressionLevel); | ||
| zstdTrainer = new ZstdDictTrainer(trainingConfig.maxTotalSampleSize, trainingConfig.maxDictionarySize, compressionLevel); | ||
| config = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is trainingConfig that is passed as parameter not being assigned to config here? when config is set to null here and then set trainingConfig after reset is called, there will be a small window where the config is null before setting to trainingConfig, don't you think it is better to set the config here instead of setting it to null and assigning trainingConfig later?
Co-authored-by: Yifan Cai <ycai@apache.org>
jyothsnakonisa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Thanks for sending a pull request! Here are some tips if you're new here:
Commit messages should follow the following format:
The Cassandra Jira