-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Motivation and Problem Description
In StarRocks, the query optimizer determines whether a Common Table Expression (CTE) should be inlined (not materialized) or materialized based on optimizer costs, and involving the number of times the CTE is referenced. Inlining is often the default behavior to enable aggressive optimizations like Predicate Pushdown if CTE is only used once.
However, this automatic decision process can lead to two main issues where user intervention is necessary:
- Semantic Errors with Non-Deterministic Functions: When a CTE or the query referencing it contains non-deterministic functions (e.g.,
RAND(),NOW()), inlining and subsequent query rewrite (like Predicate Pushdown) can cause the function to be re-evaluated or pushed down inappropriately. If the CTE is referenced multiple times, or if the logic requires sampling a fixed intermediate result, the re-evaluation of the non-deterministic function violates the query's intended semantics and leads to incorrect results (e.g., inconsistent sampling, or zero rows returned). - Suboptimal Performance: In complex scenarios, the optimizer's heuristic might incorrectly decide to inline a large CTE that is referenced multiple times. Materializing the CTE once could save significant re-computation time, even if the CTE doesn't contain non-deterministic functions. Conversely, the optimizer might incorrectly choose to materialize a small, simple CTE, blocking beneficial pushdowns.
To provide users with control over the execution plan for correctness and performance, StarRocks should support the standard SQL syntax extensions MATERIALIZED and NOT MATERIALIZED within the WITH clause.
RAND() will be pushed down into table scans:
WITH oversampled AS (
SELECT *
FROM test_view1
WHERE RAND() < 0.0010516300994269146
)
SELECT *
FROM oversampled
ORDER BY RAND()
LIMIT 50000;
Existing Implementations
This syntax is common practice in the industry:
- PostgreSQL: https://www.postgresql.org/docs/current/queries-with.html#QUERIES-WITH-CTE-MATERIALIZATION
- DuckDB: https://duckdb.org/docs/stable/sql/query_syntax/with
Proposed Solution
Implement the standard SQL syntax extensions to explicitly instruct the optimizer on CTE behavior:
| Keyword | Action | Primary Use Case |
|---|---|---|
MATERIALIZED |
Forces the CTE result to be computed once and stored temporarily (materialized). Subsequent references read from the stored result. | Guaranteeing semantic correctness for queries involving non-deterministic functions (like RAND()), or preventing expensive re-computation of complex subqueries referenced multiple times. |
NOT MATERIALIZED |
Forces the CTE to be inlined into the main query, regardless of reference count or optimizer heuristics. | Overriding the optimizer when it incorrectly decides to materialize a CTE that would benefit from predicate pushdown and other cross-query optimizations. |
Example Case Demonstrating the Problem and Solution
The following example shows how the lack of a MATERIALIZED keyword leads to incorrect random sampling results, where the RAND() predicate is pushed too deep due to inlining.
Initial Setup (Tables and View):
CREATE TABLE `t1` (
`id` int(11) NULL COMMENT "id",
`pt` datetime NOT NULL COMMENT "",
`gmv` int(11) NULL COMMENT "gmv"
) ENGINE=OLAP
DUPLICATE KEY(`id`)
PARTITION BY date_trunc('day', pt)
DISTRIBUTED BY HASH(`pt`)
PROPERTIES (
"replication_num" = "1"
);
CREATE TABLE `t2` (
`id` int(11) NULL COMMENT "id",
`pt` datetime NOT NULL COMMENT "",
`gmv` int(11) NULL COMMENT "gmv"
) ENGINE=OLAP
DUPLICATE KEY(`id`)
PARTITION BY date_trunc('day', pt)
DISTRIBUTED BY HASH(`pt`)
PROPERTIES (
"replication_num" = "1"
);
create view test_view1 as
select t1.id, t1.pt, t1.gmv as gmv1, t2.gmv as gmv2
from t1, t2 where date_trunc('day', t1.pt) = date_trunc('day', t2.pt);
insert into t1 values(1, '2025-12-02 10:00:01', 1);
insert into t2 values(1, '2025-12-02 10:00:01', 1);Problematic Query (Current Behavior - Default Inlining):
WITH oversampled AS (
SELECT *
FROM test_view1
WHERE RAND() < 0.0010516300994269146
)
SELECT *
FROM oversampled
ORDER BY RAND()
LIMIT 50000;Issue: The RAND() predicate is pushed down and evaluated against the underlying tables (t1, t2), potentially leading to unexpected or zero results, violating the goal of sampling the result of test_view1.
Proposed Correct Query (Using MATERIALIZED):
WITH test_view1_cte AS MATERIALIZED (
SELECT *
FROM test_view1), -- This subquery result is computed and stored once.
oversampled AS (
SELECT *
FROM test_view1_cte
WHERE RAND() < 0.0010516300994269146 -- Sampling is now reliably applied to the fixed, stored result set.
)
SELECT *
FROM oversampled
ORDER BY RAND()
LIMIT 50000;
WITH test_view1_cte AS MATERIALIZED (
SELECT *
FROM test_view1), -- This subquery result is computed and stored once.
oversampled AS (
SELECT *
FROM test_view1_cte
WHERE RAND() < 0.0010516300994269146 -- Sampling is now reliably applied to the fixed, stored result set.
union all
SELECT *
FROM test_view1_cte
WHERE RAND() < 0.0010516300994269146 -- The second sampling operation also uses the same fixed result set.
)
SELECT *
FROM oversampled
ORDER BY RAND()
LIMIT 50000;