Skip to content

Conversation

@smiklosovic
Copy link
Contributor

Thanks for sending a pull request! Here are some tips if you're new here:

  • Ensure you have added or run the appropriate tests for your PR.
  • Be sure to keep the PR description updated to reflect all changes.
  • Write your PR title to summarize what this PR proposes.
  • If possible, provide a concise example to reproduce the issue for a faster review.
  • Read our contributor guidelines
  • If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

@smiklosovic smiklosovic requested a review from yifan-c December 15, 2025 17:27
@smiklosovic smiklosovic force-pushed the CASSANDRA-21078 branch 2 times, most recently from 4d0a88f to 49c5bc6 Compare December 15, 2025 17:34
Copy link
Contributor

@yifan-c yifan-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall. As discussed offline, it makes better sense to have those parameters as part of compression table attribute.

@smiklosovic smiklosovic force-pushed the CASSANDRA-21078 branch 2 times, most recently from 69abf82 to a353297 Compare December 16, 2025 14:52
@smiklosovic smiklosovic requested a review from yifan-c December 16, 2025 14:55
@smiklosovic smiklosovic force-pushed the CASSANDRA-21078 branch 5 times, most recently from 42d165d to 6748c80 Compare December 18, 2025 12:58
Copy link
Contributor

@jyothsnakonisa jyothsnakonisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall a few minor comments around inconsistencies in documentation and naming around sampling rate parameter

working_directory: ~/
shell: /bin/bash -eo pipefail -l
parallelism: 4
parallelism: 25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you checkin these changes in the file by accident?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a common practice to check-in the Circle config to run CI with more resources. The particular commit (typically named "do not commit") will be dropped when committing the patch.

}

@VisibleForTesting
public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingRate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are passing 1/samplingRate here, should the name of the variable be percentage? or any other appropriate name

Suggested change
public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingRate)
public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingPercentage)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep it as it is.

Previously, if you had a sample rate of 0.01, then
Math.round(1 / 0.01) = 100 and shouldSample method was doing

ThreadLocalRandom.current().nextInt(samplingRate) == 0

which picked a number from 0 (inclusive) to 100 (exclusive).

However, if we set sample rate to, for example, 0.75 to say that
75% should be sampled, then Math.round(1 / 0.75) = 1.33 and rounded it is 1.

ThreadLocalRandom.current().nextInt(1) == 0

will be always true. That means what basically from some sample rate which rounds to 1
we lose the probability.

The current approach works with floats and it is rewritten to

ThreadLocalRandom.current().nextFloat() < samplingRate

nextFloat() gives values between zero (inclusive) and one (exclusive).
If we set sampling rate to 1 then it will be always true. If we set it to 0.01 that will be 1%.
If we set it to 0.75 that will be 75%, without losing any accuracy.
Copy link
Contributor

@jyothsnakonisa jyothsnakonisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good mostly, left few minor comments about threadsafety.

private final String keyspaceName;
private final String tableName;
private final CompressionDictionaryTrainingConfig config;
private volatile CompressionDictionaryTrainingConfig config;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since config can be reset to null, can you add a null check before using it? I think you should add the check in buildNotReadyMessage() method and please check if it should be added anywhere else.

public boolean shouldSample()
{
return zstdTrainer != null && ThreadLocalRandom.current().nextInt(samplingRate) == 0;
return zstdTrainer != null && ThreadLocalRandom.current().nextFloat() < samplingRate;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make zstdTrainer volatile? It is updated using synchronized but while reading there is no synchronized. We can still read stale value even after synchronized update.

sampleCount.set(0);
zstdTrainer = new ZstdDictTrainer(config.maxTotalSampleSize, config.maxDictionarySize, compressionLevel);
zstdTrainer = new ZstdDictTrainer(trainingConfig.maxTotalSampleSize, trainingConfig.maxDictionarySize, compressionLevel);
config = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is trainingConfig that is passed as parameter not being assigned to config here? when config is set to null here and then set trainingConfig after reset is called, there will be a small window where the config is null before setting to trainingConfig, don't you think it is better to set the config here instead of setting it to null and assigning trainingConfig later?

Copy link
Contributor

@jyothsnakonisa jyothsnakonisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants