-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Labels
type/bugSomething isn't workingSomething isn't working
Description
We are experiencing consistent segfault crashes of StarRocks Compute Nodes under high load on ARM64. We are running StarRocks 3.5.8 in shared data cluster mode.
There are seems to be multiple failure modes, here are some stats from roughly a week of running the nodes under highly concurrent load.
Platform: ARM64 CentOS (aarch64)
Version: StarRocks 3.5.8 (build 88ffb18)
Failure Mode 1: NodeChannel::is_full()
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1764152401 (unix time) try "date -d @1764152401" if you are using GNU date ***
PC: @ 0x457a4b4 starrocks::NodeChannel::is_full()
*** SIGSEGV (@0x6b736902e8) received by PID 719458 (TID 0xfffe9773fe40) LWP(720414) from PID 1936261864; stack trace: ***
@ 0xffff7e42ebc0 __pthread_once_slow
@ 0xa710bf0 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0xffff7f4d6df0 PosixSignals::chained_handler(int, siginfo_t*, void*) [clone .part.0]
@ 0xffff7f4d77bc JVM_handle_linux_signal
@ 0xffff7fc5e830 ([vdso]+0x82f)
@ 0x457a4b4 starrocks::NodeChannel::is_full()
@ 0x4587278 std::_Function_handler<void (starrocks::NodeChannel*), starrocks::TabletSinkSender::is_full()::{lambda(starrocks::NodeChannel*)#1}>::_M_invoke(std::_Any_data const&, starrocks::NodeChannel*&&)
@ 0x4587cdc starrocks::TabletSinkSender::is_full()
@ 0x4563e48 starrocks::OlapTableSink::is_full()
@ 0x498a324 starrocks::pipeline::OlapTableSinkOperator::pending_finish() const
Failure Mode 2: NodeChannel::is_open_done()
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1764805833 (unix time) try "date -d @1764805833" if you are using GNU date ***
PC: @ 0x457a464 starrocks::NodeChannel::is_open_done()
*** SIGSEGV (@0x0) received by PID 879445 (TID 0xfffea3adfe40) LWP(880401) from PID 0; stack trace: ***
@ 0xffff8ac6fbc0 __pthread_once_slow
@ 0xa710bf0 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0xffff8bcd6df0 PosixSignals::chained_handler(int, siginfo_t*, void*) [clone .part.0]
@ 0xffff8bcd77bc JVM_handle_linux_signal
@ 0xffff8c489830 ([vdso]+0x82f)
@ 0x457a464 starrocks::NodeChannel::is_open_done()
@ 0x4587240 std::_Function_handler<void (starrocks::NodeChannel*), starrocks::TabletSinkSender::is_open_done()::{lambda(starrocks::NodeChannel*)#1}>::_M_invoke(std::_Any_data const&, starrocks::NodeChannel*&&)
@ 0x4587e88 starrocks::TabletSinkSender::is_open_done()
@ 0x4989784 starrocks::pipeline::OlapTableSinkOperator::need_input() const
@ 0x4e958e0 starrocks::pipeline::PipelineDriver::is_not_blocked()
Failure Mode 3: PipelineDriver::is_not_blocked()
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1764819033 (unix time) try "date -d @1764819033" if you are using GNU date ***
PC: @ 0x4e956bc starrocks::pipeline::PipelineDriver::is_not_blocked()
*** SIGSEGV (@0x0) received by PID 1233889 (TID 0xfffea461fe40) LWP(1234845) from PID 0; stack trace: ***
@ 0xffff8b1edbc0 __pthread_once_slow
@ 0xa710bf0 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0xffff8c2d6df0 PosixSignals::chained_handler(int, siginfo_t*, void*) [clone .part.0]
@ 0xffff8c2d77bc JVM_handle_linux_signal
@ 0xffff8c9f6830 ([vdso]+0x82f)
@ 0x4e956bc starrocks::pipeline::PipelineDriver::is_not_blocked()
@ 0x4e94530 starrocks::pipeline::PipelineDriverPoller::run_internal()
@ 0x3d39844 starrocks::Thread::supervise_thread(void*)
@ 0xffff8b1e8b78 start_thread
@ 0xffff8b255cdc thread_start
[1764819033.766][thread: 281469144661568] je_mallctl execute purge success
F20251204 03:30:33.978367 281463526456896 socket_inl.h:123] Over dereferenced SocketId=0
[1764819033.981][thread: 281463526456896] je_mallctl execute purge success
[1764819033.766][thread: 281469144661568] je_mallctl execute dontdump success
start time: Thu Dec 4 03:35:43 UTC 2025, server uptime: 03:35:43 up 20:55, 0 users, load average: 830.40, 606.98, 273.09
Run with JEMALLOC_CONF: 'percpu_arena:percpu,oversize_threshold:0,muzzy_decay_ms:5000,dirty_decay_ms:5000,metadata_thp:auto,background_thread:true,prof:true,prof_active:false'
3.5.8 RELEASE (build 88ffb18 distro centos arch aarch64)
Failure Mode 4: BinaryColumnBase::_build_slices()
*** Aborted at 1764819033 (unix time) try "date -d @1764819033" if you are using GNU date ***
PC: @ 0x4500b50 starrocks::BinaryColumnBase<unsigned int>::_build_slices() const
*** SIGSEGV (@0xfffb87d80000) received by PID 1734655 (TID 0xfffe8f45fe40) LWP(1735625) from PID 18446744071693664256; stack trace: ***
@ 0xffff7e42ebc0 __pthread_once_slow
@ 0xa710bf0 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0xffff7f4d6df0 PosixSignals::chained_handler(int, siginfo_t*, void*) [clone .part.0]
@ 0xffff7f4d77bc JVM_handle_linux_signal
@ 0xffff7fc21830 ([vdso]+0x82f)
@ 0x4500b50 starrocks::BinaryColumnBase<unsigned int>::_build_slices() const
@ 0x44a2d40 starrocks::ColumnViewer<(starrocks::LogicalType)17>::ColumnViewer(starrocks::Cow<starrocks::Column>::ImmutPtr<starrocks::Column> const&)
@ 0x6630ab8 starrocks::CastStringToArray::evaluate_checked(starrocks::ExprContext*, starrocks::Chunk*)
@ 0x5b65578 starrocks::ExprContext::evaluate(starrocks::Expr*, starrocks::Chunk*, unsigned char*)
@ 0x683753c starrocks::ArrayMapExpr::evaluate_checked(starrocks::ExprContext*, starrocks::Chunk*)
Failure Mode 5: ScanOperator::set_finishing()
query_id:00000000-0000-0000-0000-000000000000, fragment_instance:00000000-0000-0000-0000-000000000000
*** Aborted at 1764805833 (unix time) try "date -d @1764805833" if you are using GNU date ***
PC: @ 0x499ab90 starrocks::pipeline::ScanOperator::set_finishing(starrocks::RuntimeState*)
*** SIGSEGV (@0x7469726f6874ed) received by PID 1250196 (TID 0xfffeccc5fe40) LWP(1251147) from PID 1869116653; stack trace: ***
@ 0xffffb374ebc0 __pthread_once_slow
@ 0xa710bf0 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0xffffb46d6df0 PosixSignals::chained_handler(int, siginfo_t*, void*) [clone .part.0]
@ 0xffffb46d77bc JVM_handle_linux_signal
@ 0xffffb4f71830 ([vdso]+0x82f)
@ 0x499ab90 starrocks::pipeline::ScanOperator::set_finishing(starrocks::RuntimeState*)
@ 0x4a28744 starrocks::pipeline::PipelineDriver::_mark_operator_finishing(std::shared_ptr<starrocks::pipeline::Operator>&, starrocks::RuntimeState*)
@ 0x4a289fc starrocks::pipeline::PipelineDriver::_mark_operator_finished(std::shared_ptr<starrocks::pipeline::Operator>&, starrocks::RuntimeState*)
@ 0x4a29178 starrocks::pipeline::PipelineDriver::_mark_operator_cancelled(std::shared_ptr<starrocks::pipeline::Operator>&, starrocks::RuntimeState*)
@ 0x4a29694 starrocks::pipeline::PipelineDriver::cancel_operators(starrocks::RuntimeState*)
AI: Summary of Findings
The following is the analysis of the LLM, which includes a fully AI-generated patch. I can't judge the quality of it, so take it or leave it
🤖 🤖 🤖
Root Causes Identified
-
Use-After-Free in NodeChannel (51% of crashes)
- Location:
tablet_sink_index_channel.cpp:433(is_full()) - Cause: Operator accesses
_add_batch_closures[i]on deleted NodeChannel - Fix: Shared pointer ownership
- Location:
-
ARM64 Memory Ordering Issues
- Location:
ref_count_closure.h:34(count()) - Cause: Missing memory barriers in atomic operations
- Fix: Use
std::memory_order_acquire/release
- Location:
-
Iterator Invalidation in TabletSinkSender
- Location:
tablet_sink_sender.h:58(for_each_node_channel()) - Cause: Concurrent modification during iteration
- Fix: Add shared_mutex protection
- Location:
Crash Statistics
| Function | Crashes | % |
|---|---|---|
| NodeChannel::is_full() | 24 | 51% |
| PipelineDriver::is_not_blocked() | 4 | 8.5% |
| NodeChannel::is_open_done() | 3 | 6.4% |
| BinaryColumnBase::_build_slices() | 3 | 6.4% |
| ScanOperator::set_finishing() | 2 | 4.3% |
| Other | 11 | 23.4% |
| Total | 47 | 100% |
Memory Access Patterns
- Null pointers (0x0): 6 crashes
- Small offsets (0x2e8, 0x310): 15+ crashes (deleted object access)
- Corrupted addresses: 26+ crashes (heap corruption)
Patch
diff --git a/be/src/exec/data_sink.cpp b/be/src/exec/data_sink.cpp
index 0431509f..6b070de3 100644
--- a/be/src/exec/data_sink.cpp
+++ b/be/src/exec/data_sink.cpp
@@ -272,7 +272,8 @@ DIAGNOSTIC_IGNORE("-Wpotentially-evaluated-expression")
Status DataSink::decompose_data_sink_to_pipeline(pipeline::PipelineBuilderContext* context, RuntimeState* runtime_state,
pipeline::OpFactories prev_operators,
const pipeline::UnifiedExecPlanFragmentParams& request,
- const TDataSink& thrift_sink, const std::vector<TExpr>& output_exprs) {
+ const TDataSink& thrift_sink, const std::vector<TExpr>& output_exprs,
+ std::shared_ptr<DataSink> parent_sink) {
using namespace pipeline;
auto fragment_ctx = context->fragment_context();
size_t dop = context->source_operator(prev_operators)->degree_of_parallelism();
@@ -415,15 +416,15 @@ Status DataSink::decompose_data_sink_to_pipeline(pipeline::PipelineBuilderContex
size_t desired_tablet_sink_dop = request.pipeline_sink_dop();
DCHECK(desired_tablet_sink_dop > 0);
runtime_state->set_num_per_fragment_instances(request.common().params.num_senders);
- std::vector<std::unique_ptr<AsyncDataSink>> tablet_sinks;
+ std::vector<std::shared_ptr<AsyncDataSink>> tablet_sinks;
for (int i = 1; i < desired_tablet_sink_dop; i++) {
Status st;
- std::unique_ptr<AsyncDataSink> sink;
+ std::shared_ptr<AsyncDataSink> sink;
if (typeid(*this) == typeid(OlapTableSink)) {
- sink = std::make_unique<OlapTableSink>(runtime_state->obj_pool(), output_exprs, &st, runtime_state);
+ sink = std::make_shared<OlapTableSink>(runtime_state->obj_pool(), output_exprs, &st, runtime_state);
RETURN_IF_ERROR(st);
} else {
- sink = std::make_unique<MultiOlapTableSink>(runtime_state->obj_pool(), output_exprs);
+ sink = std::make_shared<MultiOlapTableSink>(runtime_state->obj_pool(), output_exprs);
}
if (sink != nullptr) {
RETURN_IF_ERROR(sink->init(thrift_sink, runtime_state));
@@ -431,7 +432,7 @@ Status DataSink::decompose_data_sink_to_pipeline(pipeline::PipelineBuilderContex
tablet_sinks.emplace_back(std::move(sink));
}
OpFactoryPtr tablet_sink_op = std::make_shared<OlapTableSinkOperatorFactory>(
- context->next_operator_id(), this, fragment_ctx, request.sender_id(), desired_tablet_sink_dop,
+ context->next_operator_id(), parent_sink, fragment_ctx, request.sender_id(), desired_tablet_sink_dop,
tablet_sinks);
// FE will pre-set the parallelism for all fragment instance which contains the tablet sink,
// For stream load, routine load or broker load, the desired_tablet_sink_dop set
diff --git a/be/src/exec/data_sink.h b/be/src/exec/data_sink.h
index a704837d..fb9605f4 100644
--- a/be/src/exec/data_sink.h
+++ b/be/src/exec/data_sink.h
@@ -100,7 +100,8 @@ public:
Status decompose_data_sink_to_pipeline(pipeline::PipelineBuilderContext* context, RuntimeState* state,
pipeline::OpFactories prev_operators,
const pipeline::UnifiedExecPlanFragmentParams& request,
- const TDataSink& thrift_sink, const std::vector<TExpr>& output_exprs);
+ const TDataSink& thrift_sink, const std::vector<TExpr>& output_exprs,
+ std::shared_ptr<DataSink> parent_sink = nullptr);
private:
OperatorFactoryPtr _create_exchange_sink_operator(pipeline::PipelineBuilderContext* context,
diff --git a/be/src/exec/pipeline/fragment_context.cpp b/be/src/exec/pipeline/fragment_context.cpp
index 0f5f69d1..41063f30 100644
--- a/be/src/exec/pipeline/fragment_context.cpp
+++ b/be/src/exec/pipeline/fragment_context.cpp
@@ -67,7 +67,7 @@ void FragmentContext::close_all_execution_groups() {
void FragmentContext::move_tplan(TPlan& tplan) {
swap(_tplan, tplan);
}
-void FragmentContext::set_data_sink(std::unique_ptr<DataSink> data_sink) {
+void FragmentContext::set_data_sink(std::shared_ptr<DataSink> data_sink) {
_data_sink = std::move(data_sink);
}
diff --git a/be/src/exec/pipeline/fragment_context.h b/be/src/exec/pipeline/fragment_context.h
index 3b337570..5c3f38a2 100644
--- a/be/src/exec/pipeline/fragment_context.h
+++ b/be/src/exec/pipeline/fragment_context.h
@@ -73,7 +73,7 @@ public:
void move_tplan(TPlan& tplan);
const TPlan& tplan() const { return _tplan; }
- void set_data_sink(std::unique_ptr<DataSink> data_sink);
+ void set_data_sink(std::shared_ptr<DataSink> data_sink);
size_t total_dop() const;
@@ -201,7 +201,7 @@ private:
// Hold tplan data datasink from delivery request to create driver lazily
// after delivery request has been finished.
TPlan _tplan;
- std::unique_ptr<DataSink> _data_sink;
+ std::shared_ptr<DataSink> _data_sink;
// promise used to determine whether fragment finished its execution
FragmentPromise _finish_promise;
diff --git a/be/src/exec/pipeline/fragment_executor.cpp b/be/src/exec/pipeline/fragment_executor.cpp
index e3a630e8..6a13ec77 100644
--- a/be/src/exec/pipeline/fragment_executor.cpp
+++ b/be/src/exec/pipeline/fragment_executor.cpp
@@ -737,7 +737,7 @@ Status FragmentExecutor::_prepare_pipeline_driver(ExecEnv* exec_env, const Unifi
PipelineBuilder builder(context);
auto exec_ops = builder.decompose_exec_node_to_pipeline(*_fragment_ctx, plan);
// Set up sink if required
- std::unique_ptr<DataSink> datasink;
+ std::shared_ptr<DataSink> datasink;
if (request.isset_output_sink()) {
const auto& tsink = request.output_sink();
if (tsink.type == TDataSinkType::RESULT_SINK || tsink.type == TDataSinkType::OLAP_TABLE_SINK ||
@@ -747,10 +747,12 @@ Status FragmentExecutor::_prepare_pipeline_driver(ExecEnv* exec_env, const Unifi
tsink.type == TDataSinkType::DICTIONARY_CACHE_SINK) {
_query_ctx->set_final_sink();
}
+ std::unique_ptr<DataSink> unique_sink;
RETURN_IF_ERROR(DataSink::create_data_sink(runtime_state, tsink, fragment.output_exprs, params,
- request.sender_id(), plan->row_desc(), &datasink));
+ request.sender_id(), plan->row_desc(), &unique_sink));
+ datasink = std::move(unique_sink);
RETURN_IF_ERROR(datasink->decompose_data_sink_to_pipeline(&context, runtime_state, std::move(exec_ops), request,
- tsink, fragment.output_exprs));
+ tsink, fragment.output_exprs, datasink));
}
_fragment_ctx->set_data_sink(std::move(datasink));
auto [exec_groups, pipelines] = builder.build();
diff --git a/be/src/exec/pipeline/olap_table_sink_operator.cpp b/be/src/exec/pipeline/olap_table_sink_operator.cpp
index a48a7a4b..33e58483 100644
--- a/be/src/exec/pipeline/olap_table_sink_operator.cpp
+++ b/be/src/exec/pipeline/olap_table_sink_operator.cpp
@@ -170,7 +170,7 @@ OperatorPtr OlapTableSinkOperatorFactory::create(int32_t degree_of_parallelism,
_sink0, _fragment_ctx, _num_sinkers);
} else {
return std::make_shared<OlapTableSinkOperator>(this, _id, _plan_node_id, driver_sequence, _cur_sender_id++,
- _sinks[driver_sequence - 1].get(), _fragment_ctx, _num_sinkers);
+ _sinks[driver_sequence - 1], _fragment_ctx, _num_sinkers);
}
}
diff --git a/be/src/exec/pipeline/olap_table_sink_operator.h b/be/src/exec/pipeline/olap_table_sink_operator.h
index d2e542e2..70335aa2 100644
--- a/be/src/exec/pipeline/olap_table_sink_operator.h
+++ b/be/src/exec/pipeline/olap_table_sink_operator.h
@@ -14,6 +14,7 @@
#pragma once
+#include <memory>
#include <utility>
#include "exec/pipeline/fragment_context.h"
@@ -28,10 +29,10 @@ namespace pipeline {
class OlapTableSinkOperator final : public Operator {
public:
OlapTableSinkOperator(OperatorFactory* factory, int32_t id, int32_t plan_node_id, int32_t driver_sequence,
- int32_t sender_id, starrocks::AsyncDataSink* sink, FragmentContext* const fragment_ctx,
- std::atomic<int32_t>& num_sinkers)
+ int32_t sender_id, std::shared_ptr<starrocks::AsyncDataSink> sink,
+ FragmentContext* const fragment_ctx, std::atomic<int32_t>& num_sinkers)
: Operator(factory, id, "olap_table_sink", plan_node_id, false, driver_sequence),
- _sink(sink),
+ _sink(std::move(sink)),
_fragment_ctx(fragment_ctx),
_num_sinkers(num_sinkers),
_sender_id(sender_id) {}
@@ -59,7 +60,7 @@ public:
Status push_chunk(RuntimeState* state, const ChunkPtr& chunk) override;
private:
- starrocks::AsyncDataSink* _sink;
+ std::shared_ptr<starrocks::AsyncDataSink> _sink;
FragmentContext* const _fragment_ctx;
std::atomic<int32_t>& _num_sinkers;
@@ -74,11 +75,11 @@ private:
class OlapTableSinkOperatorFactory final : public OperatorFactory {
public:
- OlapTableSinkOperatorFactory(int32_t id, starrocks::DataSink* sink, FragmentContext* const fragment_ctx,
- int32_t start_sender_id, size_t tablet_sink_dop,
- std::vector<std::unique_ptr<starrocks::AsyncDataSink>>& tablet_sinks)
+ OlapTableSinkOperatorFactory(int32_t id, std::shared_ptr<starrocks::DataSink> sink,
+ FragmentContext* const fragment_ctx, int32_t start_sender_id, size_t tablet_sink_dop,
+ std::vector<std::shared_ptr<starrocks::AsyncDataSink>>& tablet_sinks)
: OperatorFactory(id, "olap_table_sink", Operator::s_pseudo_plan_node_id_for_final_sink),
- _sink0(down_cast<starrocks::AsyncDataSink*>(sink)),
+ _sink0(std::static_pointer_cast<starrocks::AsyncDataSink>(sink)),
_fragment_ctx(fragment_ctx),
_cur_sender_id(start_sender_id),
_sinks(std::move(tablet_sinks)) {}
@@ -94,11 +95,11 @@ public:
private:
void _increment_num_sinkers_no_barrier() { _num_sinkers.fetch_add(1, std::memory_order_relaxed); }
- starrocks::AsyncDataSink* _sink0;
+ std::shared_ptr<starrocks::AsyncDataSink> _sink0;
FragmentContext* const _fragment_ctx;
std::atomic<int32_t> _num_sinkers = 0;
int32_t _cur_sender_id;
- std::vector<std::unique_ptr<starrocks::AsyncDataSink>> _sinks;
+ std::vector<std::shared_ptr<starrocks::AsyncDataSink>> _sinks;
};
} // namespace pipeline
Detailed Verification
1. The Root Cause
The crash is a Use-After-Free race condition.
- Mechanism:
FragmentContextowns the DataSink viaunique_ptr. When a query is cancelled,FragmentContextdestroys the Sink. However, parallelPipelineDriverthreads may still be executingsink->close()orsink->is_full(). - The patch ensures that
FragmentContextandOlapTableSinkOperatorshare ownership of the primary sink. The sink cannot be destroyed until both are done with it.
2. Thread Safety & Performance
The analysis claims the performance impact is negligible. I agree:
- Overhead: We are trading a raw pointer dereference for
shared_ptraccess. The atomic reference counting only happens during operator creation/destruction (setup phase), not during the hot path (pushing chunks). - Hot Path: Inside
push_chunk, the operator accesses the sink.shared_ptraccess is effectively as fast as a raw pointer once the object is loaded.
3. Edge Cases
- Circular References: I verified the object graph.
FragmentContext->Pipeline->Operator->DataSink. There is no back-pointer fromDataSinktoFragmentContext, so no cycles are introduced. - Null Parent Sink: The analysis suggests a default
nullptrforparent_sink. I recommend removing the default argument to force all call sites to be updated, ensuring we don't accidentally leave a code path unprotected.
kevincai
Metadata
Metadata
Assignees
Labels
type/bugSomething isn't workingSomething isn't working