broadcast_matmul to handle stride-0 broadcast dimensions

While implementing CPU Grouped-Query Attention, I encountered a limitation with `broadcast_matmul` and stride-0 dimensions. I'm taking a different approach (fused SIMD kernels) for my use case, but wanted to log this for others who may benefit from the fix.

Currently `broadcast_as` creates stride-0 view and `matmul` rejects stride-0 as "non-contiguous". So, I needed to use the workaround: `.broadcast_as(...).contiguous()` but this physically expands memory.

```rust
// GQA: Q has 16 heads, K/V have 8 heads (2 groups)
let q = ...;  // [1, 8, 2, 2, 128]  
let k = ...;  // [1, 8, 1, 128, 2] ← size-1 dim should broadcast

let scores = q.broadcast_matmul(&k)?;  // Error: "non-contiguous lhs"
```

I expected it to broadcast dim 2 of rhs (1 → 2) during matmul  
But instead I got an error after internal broadcast creates stride-0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

broadcast_matmul to handle stride-0 broadcast dimensions #3253

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

broadcast_matmul to handle stride-0 broadcast dimensions #3253

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions