CASSANDRA-21078 move training params to CQL #4523

smiklosovic · 2025-12-15T17:27:26Z

Thanks for sending a pull request! Here are some tips if you're new here:

Ensure you have added or run the appropriate tests for your PR.
Be sure to keep the PR description updated to reflect all changes.
Write your PR title to summarize what this PR proposes.
If possible, provide a concise example to reproduce the issue for a faster review.
Read our contributor guidelines
If you're making a documentation change, see our guide to documentation contribution

Commit messages should follow the following format:

<One sentence description, usually Jira title or CHANGES.txt summary>

<Optional lengthier description (context on patch)>

patch by <Authors>; reviewed by <Reviewers> for CASSANDRA-#####

Co-authored-by: Name1 <email1>
Co-authored-by: Name2 <email2>

The Cassandra Jira

src/java/org/apache/cassandra/db/compression/CompressionDictionaryManagerMBean.java

src/java/org/apache/cassandra/io/compress/IDictionaryCompressor.java

src/java/org/apache/cassandra/io/compress/ZstdDictionaryCompressor.java

src/java/org/apache/cassandra/tools/NodeProbe.java

yifan-c

looks good overall. As discussed offline, it makes better sense to have those parameters as part of compression table attribute.

src/java/org/apache/cassandra/tools/nodetool/CompressionDictionaryCommandGroup.java

doc/modules/cassandra/pages/managing/operating/compression.adoc

jyothsnakonisa

Looks good overall a few minor comments around inconsistencies in documentation and naming around sampling rate parameter

jyothsnakonisa · 2025-12-19T19:16:45Z

.circleci/config.yml

    working_directory: ~/
    shell: /bin/bash -eo pipefail -l
-    parallelism: 4
+    parallelism: 25


Did you checkin these changes in the file by accident?

It is a common practice to check-in the Circle config to run CI with more resources. The particular commit (typically named "do not commit") will be dropped when committing the patch.

jyothsnakonisa · 2025-12-19T19:32:06Z

src/java/org/apache/cassandra/db/compression/ZstdDictionaryTrainer.java

+    }
+
+    @VisibleForTesting
+    public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingRate)


You are passing 1/samplingRate here, should the name of the variable be percentage? or any other appropriate name

Suggested change

public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingRate)

public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingPercentage)

I would keep it as it is.

doc/modules/cassandra/pages/managing/operating/compression.adoc

src/java/org/apache/cassandra/db/compression/ZstdDictionaryTrainer.java

Previously, if you had a sample rate of 0.01, then Math.round(1 / 0.01) = 100 and shouldSample method was doing ThreadLocalRandom.current().nextInt(samplingRate) == 0 which picked a number from 0 (inclusive) to 100 (exclusive). However, if we set sample rate to, for example, 0.75 to say that 75% should be sampled, then Math.round(1 / 0.75) = 1.33 and rounded it is 1. ThreadLocalRandom.current().nextInt(1) == 0 will be always true. That means what basically from some sample rate which rounds to 1 we lose the probability. The current approach works with floats and it is rewritten to ThreadLocalRandom.current().nextFloat() < samplingRate nextFloat() gives values between zero (inclusive) and one (exclusive). If we set sampling rate to 1 then it will be always true. If we set it to 0.01 that will be 1%. If we set it to 0.75 that will be 75%, without losing any accuracy.

src/java/org/apache/cassandra/tools/nodetool/CompressionDictionaryCommandGroup.java

doc/modules/cassandra/pages/managing/operating/compression.adoc

jyothsnakonisa

Looks good mostly, left few minor comments about threadsafety.

jyothsnakonisa · 2025-12-22T07:23:49Z

src/java/org/apache/cassandra/db/compression/ZstdDictionaryTrainer.java

    private final String keyspaceName;
    private final String tableName;
-    private final CompressionDictionaryTrainingConfig config;
+    private volatile CompressionDictionaryTrainingConfig config;


Since config can be reset to null, can you add a null check before using it? I think you should add the check in buildNotReadyMessage() method and please check if it should be added anywhere else.

jyothsnakonisa · 2025-12-22T07:28:47Z

src/java/org/apache/cassandra/db/compression/ZstdDictionaryTrainer.java

    public boolean shouldSample()
    {
-        return zstdTrainer != null && ThreadLocalRandom.current().nextInt(samplingRate) == 0;
+        return zstdTrainer != null && ThreadLocalRandom.current().nextFloat() < samplingRate;


Can you make zstdTrainer volatile? It is updated using synchronized but while reading there is no synchronized. We can still read stale value even after synchronized update.

jyothsnakonisa · 2025-12-22T07:34:02Z

src/java/org/apache/cassandra/db/compression/ZstdDictionaryTrainer.java

            sampleCount.set(0);
-            zstdTrainer = new ZstdDictTrainer(config.maxTotalSampleSize, config.maxDictionarySize, compressionLevel);
+            zstdTrainer = new ZstdDictTrainer(trainingConfig.maxTotalSampleSize, trainingConfig.maxDictionarySize, compressionLevel);
+            config = null;


Why is trainingConfig that is passed as parameter not being assigned to config here? when config is set to null here and then set trainingConfig after reset is called, there will be a small window where the config is null before setting to trainingConfig, don't you think it is better to set the config here instead of setting it to null and assigning trainingConfig later?

Co-authored-by: Yifan Cai <ycai@apache.org>

jyothsnakonisa

Looks good to me!

smiklosovic requested a review from yifan-c December 15, 2025 17:27

smiklosovic force-pushed the CASSANDRA-21078 branch 2 times, most recently from 4d0a88f to 49c5bc6 Compare December 15, 2025 17:34

smiklosovic commented Dec 15, 2025

View reviewed changes

src/java/org/apache/cassandra/db/compression/CompressionDictionaryManagerMBean.java Outdated Show resolved Hide resolved

smiklosovic commented Dec 15, 2025

View reviewed changes

src/java/org/apache/cassandra/io/compress/IDictionaryCompressor.java Show resolved Hide resolved

smiklosovic commented Dec 15, 2025

View reviewed changes

src/java/org/apache/cassandra/io/compress/ZstdDictionaryCompressor.java Outdated Show resolved Hide resolved

smiklosovic commented Dec 15, 2025

View reviewed changes

src/java/org/apache/cassandra/tools/NodeProbe.java Show resolved Hide resolved

yifan-c reviewed Dec 16, 2025

View reviewed changes

src/java/org/apache/cassandra/tools/nodetool/CompressionDictionaryCommandGroup.java Show resolved Hide resolved

doc/modules/cassandra/pages/managing/operating/compression.adoc Outdated Show resolved Hide resolved

smiklosovic force-pushed the CASSANDRA-21078 branch 2 times, most recently from 69abf82 to a353297 Compare December 16, 2025 14:52

smiklosovic requested a review from yifan-c December 16, 2025 14:55

smiklosovic force-pushed the CASSANDRA-21078 branch 5 times, most recently from 42d165d to 6748c80 Compare December 18, 2025 12:58

jyothsnakonisa reviewed Dec 19, 2025

View reviewed changes

smiklosovic added 3 commits December 20, 2025 23:16

CASSANDRA-21078 move training params to CQL

2cfa604

make trainer to be configured per each start

41dd079

fixes

6345c36

smiklosovic force-pushed the CASSANDRA-21078 branch from 6748c80 to 6345c36 Compare December 20, 2025 23:26

yifan-c reviewed Dec 22, 2025

View reviewed changes

src/java/org/apache/cassandra/tools/nodetool/CompressionDictionaryCommandGroup.java Outdated Show resolved Hide resolved

doc/modules/cassandra/pages/managing/operating/compression.adoc Outdated Show resolved Hide resolved

jyothsnakonisa reviewed Dec 22, 2025

View reviewed changes

smiklosovic and others added 2 commits December 22, 2025 09:23

Update doc/modules/cassandra/pages/managing/operating/compression.adoc

c305e89

Co-authored-by: Yifan Cai <ycai@apache.org>

fixes

80c34f1

jyothsnakonisa approved these changes Dec 22, 2025

View reviewed changes

yifan-c approved these changes Dec 23, 2025

View reviewed changes

	public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingRate)
	public ZstdDictionaryTrainer(String keyspaceName, String tableName, int compressionLevel, int samplingPercentage)

CASSANDRA-21078 move training params to CQL #4523

Are you sure you want to change the base?

CASSANDRA-21078 move training params to CQL #4523

Uh oh!

Conversation

smiklosovic commented Dec 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yifan-c left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jyothsnakonisa left a comment

Choose a reason for hiding this comment

Uh oh!

jyothsnakonisa Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

yifan-c Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

jyothsnakonisa Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

smiklosovic Dec 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jyothsnakonisa left a comment

Choose a reason for hiding this comment

Uh oh!

jyothsnakonisa Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

jyothsnakonisa Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

jyothsnakonisa Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

jyothsnakonisa left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants