`fn create_lf_mask_inter`: pull `level_cache` slice out of inner loop #1401

coderkalyan · 2025-05-19T00:17:45Z

Hi,

Heard about the perf contest, I noticed that create lf inter function seems to be slower in rav1d than dav1d and also generates poorer assembly code. Here's my take at an improvement, it isn't quite at parity with dav1d but seems to be significantly better now.

Assuming tests pass, I'll upload some more documentation to this PR later.

This speeds up create lf mask significantly (and consistently) but the total improvement in time is small, slightly less than 1% which is within the measurement noise on my machine (Ryzen 3700X) so I'm having trouble reliably benchmarking. That said I'm pretty confident it's a net benefit so I'm posting this; if someone can reproduce the performance improvement with lower noise that would be great.

I pulled in bytemuck to avoid an unsafe transmute, if you'd prefer not to have the dependency I can change it.

Pulls the level_cache.index_mut slice out of the inner (x) loop since accesses are contiguous from 0..bw4. This reduces overhead and allows the compiler to unroll the inner loop 4 times. Also annotates the slice used to access the level cache as a &mut [[u8; 4]]. This should aid unrolling and allow the compiler to speculatively load both elements, avoiding the unnecessary bounds check each time.

kkysen · 2025-05-19T01:13:48Z

src/lf_mask.rs

 use crate::src::levels::TxfmSize;
 use crate::src::relaxed_atomic::RelaxedAtomic;
 use crate::src::tables::dav1d_txfm_dimensions;
+use bytemuck;


We're already using zerocopy, so if you can do these operations with zerocopy instead of bytemuck that would be much preferred.

Let me see if zerocopy can do the same thing. If not it's also a "relatively safe" unsafe mem::transmute call.

kkysen

I think what's happening here is that this is faster, but is creating overlapping mutable slices (see the debug panics in CI). If we can figure out a way to do this without creating those overlapping slices, perhaps with some more unsafe if we have to, that should be good. I'll take a closer look some more, but I just wanted to give you my initial thoughts.

Thanks for finding this, though!

coderkalyan · 2025-05-19T04:09:14Z

I'm somewhat surprised that it creates overlapping references since the accesses are contiguous, most likely it's between the two loops in the function. I'll look into this a bit more.

My other question would be why exactly taking separate references causes overhead, but my guess is its difficult for the compiler to speculatively load past them which gets in the way of unrolling and eliding the bounds checks.

If it doesn't work out we can ignore, it's not a huge improvement but every bit counts when you're chasing 5% :) I'll keep chipping away at things in my spare time.

iximeow · 2025-05-19T17:41:20Z

src/lf_mask.rs

+                lvl[x][0] = filter_level[0][0][0];
+                lvl[x][1] = filter_level[1][0][0];


i was also looking at this function yesterday, you might want to keep an eye out for ways to convince llvm/rustc that this store can be one 16-bit store instead of a pair of 8-bit stores (i took a "get out and push" approach while proving it out, myself: iximeow@e133a89)

the factor by which the loop is unrolled is partially a question of the size of the loop, so where this produces pairs of 5-byte stores for each iteration:

f10e9: e9 72 ff ff ff jmp f1060 f10ee: 66 90 xchg %ax,%ax f10f0: 4a 8d 04 32 lea (%rdx,%r14,1),%rax f10f4: 45 31 ff xor %r15d,%r15d f10f7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1) f10fe: 00 00 f1100: 42 88 74 b8 f3 mov %sil,-0xd(%rax,%r15,4) f1105: 42 88 5c b8 f4 mov %bl,-0xc(%rax,%r15,4) f110a: 42 88 74 b8 f7 mov %sil,-0x9(%rax,%r15,4) f110f: 42 88 5c b8 f8 mov %bl,-0x8(%rax,%r15,4) f1114: 42 88 74 b8 fb mov %sil,-0x5(%rax,%r15,4) f1119: 42 88 5c b8 fc mov %bl,-0x4(%rax,%r15,4) f111e: 42 88 74 b8 ff mov %sil,-0x1(%rax,%r15,4) f1123: 42 88 1c b8 mov %bl,(%rax,%r15,4) f1127: 49 83 c7 04 add $0x4,%r15 f112b: 4d 39 fb cmp %r15,%r11 f112e: 75 d0 jne f1100

i'd seen slightly better still throughput with the iteration being expressed as one 4-byte store:

f10f9: 72 2e jb f1129 <_ZN5rav1d3src7lf_mask26rav1d_create_lf_mask_inter17h52008c9dbd379820E+0x1b9> f10fb: 4d 89 cb mov %r9,%r11 f10fe: 66 90 xchg %ax,%ax f1100: 66 89 0f mov %cx,(%rdi) f1103: 66 89 4f 04 mov %cx,0x4(%rdi) f1107: 66 89 4f 08 mov %cx,0x8(%rdi) f110b: 66 89 4f 0c mov %cx,0xc(%rdi) f110f: 66 89 4f 10 mov %cx,0x10(%rdi) f1113: 66 89 4f 14 mov %cx,0x14(%rdi) f1117: 66 89 4f 18 mov %cx,0x18(%rdi) f111b: 66 89 4f 1c mov %cx,0x1c(%rdi) f111f: 48 83 c7 20 add $0x20,%rdi f1123: 49 83 c3 f8 add $0xfffffffffffffff8,%r11 f1127: 75 d7 jne f1100

that said i'd initially taken an approach of re-slicing from lvl_buf in the innermost loop instead of incrementing x - that worked well too, except it kept maintaining the slice length (an additional sub each loop iteration) where your approach here doesn't do that. it is unfortunate that you get the slightly wider mov %reg, %sil,-0xd(%base,%index,4) stores though..

i also had a hard time precisely measuring this one. on a 9950x, but my change was an improvement on the order of 0.5% for the 10-bit video. on the 8-bit video where i've better numbers: ~1% fewer loads/stores dispatched (presumably mostly because of the loop unrolling), 0.6% fewer instructions in total, 0.25% total fewer cycles.

i would expect your change to have similar results, maybe a less change overall because of the slightly less aggressive loop unrolling.

Thanks for the help! I didn't notice, let me try to convince LLVM to fuse the store similar to what you're doing. Your numbers sound right - they're exactly what I would estimate based on the performance improvement to this function alone. Let me look at instruction/cycle counts, it may be more useful than wall clock as there's just too much noise to measure < 1% improvements consistently.

If you'd like to share thoughts/work together on completing this fix, let me know!

On 10 bit Chimera for 4000 frames, I'm getting 1.8% fewer instructions and 0.5-1.2% wall clock compared to baseline now. The former seems to be reproducible/consistent, latter is still affected by noise. Seems better than yours which is suspicious, will take a closer look.

i think those observations make sense for the 10-bit video, and in particular on a 16-core processor. i noticed this weekend that dav1d and rav1d both struggle to keep threads busy past ~12 threads and for some inputs more threads past 18 start to harm throughput. haven't looked too much into that since the overall behavior is very consistent between the two, but it definitely seems like the overall work-scheduling situation could be.... much better

i was also looking at this function yesterday, you might want to keep an eye out for ways to convince llvm/rustc that this store can be one 16-bit store instead of a pair of 8-bit stores (i took a "get out and push" approach while proving it out, myself: iximeow@e133a89)

I think I found a way to do this simpler and safer. I'll open a PR for just that part. Thanks for finding out the 2 u8 stores aren't being coalesced! I haven't looked at the rest of the loop in detail yet, but the stores I should be able to fix.

@iximeow

Previously, two adjacent 8-bit stores were done. By doing them as `u16`s instead of `[u8]`s/`[u8; 2]`s, we can get a single 16-bit store to be emitted instead. This gets `strh`s emitted instead of the pair of `strb`s previously. Based on what @iximeow found in iximeow@e133a89 and #1401 (comment). Before: ```asm ❯ cargo asm -p rav1d --lib --rust rav1d_create_lf_mask_intra --color | rg filter_level -C 10 if start <= end && end <= len { cmp x13, x12 b.hi .LBB440_79 // /home/khyber/work/rav1d/src/lf_mask.rs : 458 let lvl = &mut *level_cache.index_mut((idx + 0.., ..2)); ldr x12, [x1, #8] // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } subs x7, x7, #1 // /home/khyber/work/rav1d/src/lf_mask.rs : 459 lvl[0] = filter_level[0][0][0]; add x12, x12, x11 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } add x11, x11, #4 // /home/khyber/work/rav1d/src/lf_mask.rs : 459 lvl[0] = filter_level[0][0][0]; strb w15, [x12] // /home/khyber/work/rav1d/src/lf_mask.rs : 460 lvl[1] = filter_level[1][0][0]; strb w16, [x12, #1] // /home/khyber/work/rav1d/src/lf_mask.rs : 455 for x in 0..bw4 { b.ne .LBB440_4 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } cmp x6, x20 // /home/khyber/work/rav1d/src/lf_mask.rs : 454 for _y in 0..bh4 { add x17, x17, x18 -- // /home/khyber/work/rav1d/src/disjoint_mut.rs : 729 if start <= end && end <= len { add x20, x3, x19 add x21, x20, #3 cmp x21, x2 b.hs .LBB440_80 // /home/khyber/work/rav1d/src/lf_mask.rs : 496 let lvl = &mut *level_cache.index_mut((idx + 2.., ..2)); ldr x2, [x1, #8] // /home/khyber/work/rav1d/src/lf_mask.rs : 497 lvl[0] = filter_level[2][0][0]; add x2, x2, x3 add x2, x2, x19 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } add x19, x19, #4 // /home/khyber/work/rav1d/src/lf_mask.rs : 493 for x in 0..cbw4 { cmp x18, x19 // /home/khyber/work/rav1d/src/lf_mask.rs : 497 lvl[0] = filter_level[2][0][0]; strb w16, [x2, #2] // /home/khyber/work/rav1d/src/lf_mask.rs : 498 lvl[1] = filter_level[3][0][0]; strb w17, [x2, #3] // /home/khyber/work/rav1d/src/lf_mask.rs : 493 for x in 0..cbw4 { b.ne .LBB440_72 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } cmp x6, x4 // /home/khyber/work/rav1d/src/lf_mask.rs : 492 for _y in 0..cbh4 { add x3, x3, x5 ``` After: ```asm ❯ cargo asm -p rav1d --lib --rust rav1d_create_lf_mask_intra --color | rg filter_level -C 10 ldr x24, [sp, #224] ldrb w10, [sp, #184] // /home/khyber/work/rav1d/src/lf_mask.rs : 447 let b_dim = b_dim.map(|it| it as usize); ldr w8, [x9, x8, lsl #2] // /home/khyber/work/rav1d/src/lf_mask.rs : 448 let bw4 = cmp::min(iw - bx, b_dim[0]); sub x9, x25, x28 ldrb w6, [sp, #176] // /home/khyber/work/rav1d/src/lf_mask.rs : 453 let filter_level_yuv = filter_level.0.map(|a| a[0][0]); ldrb w23, [x3, #32] // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/array/mod.rs : 160 } ldrb w14, [x3, #48] // /home/khyber/work/rav1d/src/lf_mask.rs : 450 let bx4 = bx & 31; and x30, x28, #0x1f // /home/khyber/work/rav1d/src/lf_mask.rs : 447 let b_dim = b_dim.map(|it| it as usize); and x26, x8, #0xff -- let lvl = &mut *level_cache.index_mut((idx + 0.., ..2)); ldr x12, [x1, #8] add w13, w11, w12 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/option.rs : 1001 match self { tbnz w13, #0, .LBB440_21 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } subs x3, x3, #1 // /home/khyber/work/rav1d/src/lf_mask.rs : 463 *u16::mut_from(lvl).unwrap() = filter_level_y; strh w15, [x12, x11] // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } add x11, x11, #4 // /home/khyber/work/rav1d/src/lf_mask.rs : 459 for x in 0..bw4 { b.ne .LBB440_4 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } cmp x18, x20 -- tbnz w19, #0, .LBB440_82 add x2, x2, x17 add x2, x2, x21 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } add x21, x21, #4 // /home/khyber/work/rav1d/src/lf_mask.rs : 496 for x in 0..cbw4 { cmp x16, x21 // /home/khyber/work/rav1d/src/lf_mask.rs : 500 *u16::mut_from(lvl).unwrap() = filter_level_uv; strh w15, [x2, #2] // /home/khyber/work/rav1d/src/lf_mask.rs : 496 for x in 0..cbw4 { b.ne .LBB440_76 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } cmp x5, x4 // /home/khyber/work/rav1d/src/lf_mask.rs : 495 for _y in 0..cbh4 { add x17, x17, x18 ```

kkysen reviewed May 19, 2025

View reviewed changes

kkysen changed the title ~~create_lf_mask_inter: pull level_cache slice out of inner loop~~ fn create_lf_mask_inter: pull level_cache slice out of inner loop May 19, 2025

iximeow reviewed May 19, 2025

View reviewed changes

kkysen mentioned this pull request May 20, 2025

perf: fn rav1d_create_lf_mask_{intra,inter}: store as u16s #1403

Merged

kkysen added the performance label Jun 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`fn create_lf_mask_inter`: pull `level_cache` slice out of inner loop #1401

`fn create_lf_mask_inter`: pull `level_cache` slice out of inner loop #1401

Uh oh!

coderkalyan commented May 19, 2025 •

edited

Loading

Uh oh!

kkysen May 19, 2025

Uh oh!

coderkalyan May 19, 2025 •

edited

Loading

Uh oh!

kkysen left a comment

Uh oh!

coderkalyan commented May 19, 2025 •

edited

Loading

Uh oh!

iximeow May 19, 2025

Uh oh!

coderkalyan May 19, 2025

Uh oh!

coderkalyan May 19, 2025 •

edited

Loading

Uh oh!

iximeow May 19, 2025

Uh oh!

kkysen May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		lvl[x][0] = filter_level[0][0][0];
		lvl[x][1] = filter_level[1][0][0];

fn create_lf_mask_inter: pull level_cache slice out of inner loop #1401

Are you sure you want to change the base?

fn create_lf_mask_inter: pull level_cache slice out of inner loop #1401

Uh oh!

Conversation

coderkalyan commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkysen May 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderkalyan May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkysen left a comment

Choose a reason for hiding this comment

Uh oh!

coderkalyan commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iximeow May 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderkalyan May 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderkalyan May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iximeow May 19, 2025

Choose a reason for hiding this comment

Uh oh!

kkysen May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`fn create_lf_mask_inter`: pull `level_cache` slice out of inner loop #1401

`fn create_lf_mask_inter`: pull `level_cache` slice out of inner loop #1401

coderkalyan commented May 19, 2025 •

edited

Loading

coderkalyan May 19, 2025 •

edited

Loading

coderkalyan commented May 19, 2025 •

edited

Loading

coderkalyan May 19, 2025 •

edited

Loading