Skip to content

[Feature] Support for MATERIALIZED / NOT MATERIALIZED in Common Table Expressions (CTEs) #66230

@LiShuMing

Description

@LiShuMing

Motivation and Problem Description

In StarRocks, the query optimizer determines whether a Common Table Expression (CTE) should be inlined (not materialized) or materialized based on optimizer costs, and involving the number of times the CTE is referenced. Inlining is often the default behavior to enable aggressive optimizations like Predicate Pushdown if CTE is only used once.

However, this automatic decision process can lead to two main issues where user intervention is necessary:

  1. Semantic Errors with Non-Deterministic Functions: When a CTE or the query referencing it contains non-deterministic functions (e.g., RAND(), NOW()), inlining and subsequent query rewrite (like Predicate Pushdown) can cause the function to be re-evaluated or pushed down inappropriately. If the CTE is referenced multiple times, or if the logic requires sampling a fixed intermediate result, the re-evaluation of the non-deterministic function violates the query's intended semantics and leads to incorrect results (e.g., inconsistent sampling, or zero rows returned).
  2. Suboptimal Performance: In complex scenarios, the optimizer's heuristic might incorrectly decide to inline a large CTE that is referenced multiple times. Materializing the CTE once could save significant re-computation time, even if the CTE doesn't contain non-deterministic functions. Conversely, the optimizer might incorrectly choose to materialize a small, simple CTE, blocking beneficial pushdowns.

To provide users with control over the execution plan for correctness and performance, StarRocks should support the standard SQL syntax extensions MATERIALIZED and NOT MATERIALIZED within the WITH clause.

RAND() will be pushed down into table scans:

WITH oversampled AS (
    SELECT    *
    FROM test_view1
    WHERE RAND() < 0.0010516300994269146
)
SELECT *
FROM oversampled
ORDER BY RAND()
LIMIT 50000;

Existing Implementations

This syntax is common practice in the industry:

Proposed Solution

Implement the standard SQL syntax extensions to explicitly instruct the optimizer on CTE behavior:

Keyword Action Primary Use Case
MATERIALIZED Forces the CTE result to be computed once and stored temporarily (materialized). Subsequent references read from the stored result. Guaranteeing semantic correctness for queries involving non-deterministic functions (like RAND()), or preventing expensive re-computation of complex subqueries referenced multiple times.
NOT MATERIALIZED Forces the CTE to be inlined into the main query, regardless of reference count or optimizer heuristics. Overriding the optimizer when it incorrectly decides to materialize a CTE that would benefit from predicate pushdown and other cross-query optimizations.

Example Case Demonstrating the Problem and Solution

The following example shows how the lack of a MATERIALIZED keyword leads to incorrect random sampling results, where the RAND() predicate is pushed too deep due to inlining.

Initial Setup (Tables and View):

CREATE TABLE `t1` (
  `id` int(11) NULL COMMENT "id",
  `pt` datetime NOT NULL COMMENT "",
  `gmv` int(11) NULL COMMENT "gmv"
) ENGINE=OLAP
DUPLICATE KEY(`id`)
PARTITION BY date_trunc('day', pt)
DISTRIBUTED BY HASH(`pt`)
PROPERTIES (
"replication_num" = "1"
);

CREATE TABLE `t2` (
  `id` int(11) NULL COMMENT "id",
  `pt` datetime NOT NULL COMMENT "",
  `gmv` int(11) NULL COMMENT "gmv"
) ENGINE=OLAP
DUPLICATE KEY(`id`)
PARTITION BY date_trunc('day', pt)
DISTRIBUTED BY HASH(`pt`)
PROPERTIES (
"replication_num" = "1"
);

create view test_view1 as
select t1.id, t1.pt, t1.gmv as gmv1, t2.gmv as gmv2
from t1, t2 where date_trunc('day', t1.pt) = date_trunc('day', t2.pt);

insert into t1 values(1, '2025-12-02 10:00:01', 1);
insert into t2 values(1, '2025-12-02 10:00:01', 1);

Problematic Query (Current Behavior - Default Inlining):

WITH oversampled AS (
    SELECT     *
    FROM test_view1
    WHERE RAND() < 0.0010516300994269146
)
SELECT *
FROM oversampled
ORDER BY RAND()
LIMIT 50000;

Issue: The RAND() predicate is pushed down and evaluated against the underlying tables (t1, t2), potentially leading to unexpected or zero results, violating the goal of sampling the result of test_view1.

Proposed Correct Query (Using MATERIALIZED):

WITH test_view1_cte AS MATERIALIZED (
SELECT  *
FROM test_view1), -- This subquery result is computed and stored once.
oversampled AS (
    SELECT *
    FROM test_view1_cte
    WHERE RAND() < 0.0010516300994269146 -- Sampling is now reliably applied to the fixed, stored result set.
)
SELECT *
FROM oversampled
ORDER BY RAND()
LIMIT 50000;


WITH test_view1_cte AS MATERIALIZED (
SELECT  *
FROM test_view1), -- This subquery result is computed and stored once.
oversampled AS (
    SELECT *
    FROM test_view1_cte
    WHERE RAND() < 0.0010516300994269146 -- Sampling is now reliably applied to the fixed, stored result set.
    union all
      SELECT *
    FROM test_view1_cte
    WHERE RAND() < 0.0010516300994269146 -- The second sampling operation also uses the same fixed result set.
)
SELECT *
FROM oversampled
ORDER BY RAND()
LIMIT 50000;

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions