-
Notifications
You must be signed in to change notification settings - Fork 61
fn create_lf_mask_inter: pull level_cache slice out of inner loop
#1401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pulls the level_cache.index_mut slice out of the inner (x) loop since accesses are contiguous from 0..bw4. This reduces overhead and allows the compiler to unroll the inner loop 4 times. Also annotates the slice used to access the level cache as a &mut [[u8; 4]]. This should aid unrolling and allow the compiler to speculatively load both elements, avoiding the unnecessary bounds check each time.
| use crate::src::levels::TxfmSize; | ||
| use crate::src::relaxed_atomic::RelaxedAtomic; | ||
| use crate::src::tables::dav1d_txfm_dimensions; | ||
| use bytemuck; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already using zerocopy, so if you can do these operations with zerocopy instead of bytemuck that would be much preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if zerocopy can do the same thing. If not it's also a "relatively safe" unsafe mem::transmute call.
kkysen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what's happening here is that this is faster, but is creating overlapping mutable slices (see the debug panics in CI). If we can figure out a way to do this without creating those overlapping slices, perhaps with some more unsafe if we have to, that should be good. I'll take a closer look some more, but I just wanted to give you my initial thoughts.
Thanks for finding this, though!
fn create_lf_mask_inter: pull level_cache slice out of inner loop
|
I'm somewhat surprised that it creates overlapping references since the accesses are contiguous, most likely it's between the two loops in the function. I'll look into this a bit more. My other question would be why exactly taking separate references causes overhead, but my guess is its difficult for the compiler to speculatively load past them which gets in the way of unrolling and eliding the bounds checks. If it doesn't work out we can ignore, it's not a huge improvement but every bit counts when you're chasing 5% :) I'll keep chipping away at things in my spare time. |
| lvl[x][0] = filter_level[0][0][0]; | ||
| lvl[x][1] = filter_level[1][0][0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was also looking at this function yesterday, you might want to keep an eye out for ways to convince llvm/rustc that this store can be one 16-bit store instead of a pair of 8-bit stores (i took a "get out and push" approach while proving it out, myself: iximeow@e133a89)
the factor by which the loop is unrolled is partially a question of the size of the loop, so where this produces pairs of 5-byte stores for each iteration:
f10e9: e9 72 ff ff ff jmp f1060
f10ee: 66 90 xchg %ax,%ax
f10f0: 4a 8d 04 32 lea (%rdx,%r14,1),%rax
f10f4: 45 31 ff xor %r15d,%r15d
f10f7: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
f10fe: 00 00
f1100: 42 88 74 b8 f3 mov %sil,-0xd(%rax,%r15,4)
f1105: 42 88 5c b8 f4 mov %bl,-0xc(%rax,%r15,4)
f110a: 42 88 74 b8 f7 mov %sil,-0x9(%rax,%r15,4)
f110f: 42 88 5c b8 f8 mov %bl,-0x8(%rax,%r15,4)
f1114: 42 88 74 b8 fb mov %sil,-0x5(%rax,%r15,4)
f1119: 42 88 5c b8 fc mov %bl,-0x4(%rax,%r15,4)
f111e: 42 88 74 b8 ff mov %sil,-0x1(%rax,%r15,4)
f1123: 42 88 1c b8 mov %bl,(%rax,%r15,4)
f1127: 49 83 c7 04 add $0x4,%r15
f112b: 4d 39 fb cmp %r15,%r11
f112e: 75 d0 jne f1100
i'd seen slightly better still throughput with the iteration being expressed as one 4-byte store:
f10f9: 72 2e jb f1129 <_ZN5rav1d3src7lf_mask26rav1d_create_lf_mask_inter17h52008c9dbd379820E+0x1b9>
f10fb: 4d 89 cb mov %r9,%r11
f10fe: 66 90 xchg %ax,%ax
f1100: 66 89 0f mov %cx,(%rdi)
f1103: 66 89 4f 04 mov %cx,0x4(%rdi)
f1107: 66 89 4f 08 mov %cx,0x8(%rdi)
f110b: 66 89 4f 0c mov %cx,0xc(%rdi)
f110f: 66 89 4f 10 mov %cx,0x10(%rdi)
f1113: 66 89 4f 14 mov %cx,0x14(%rdi)
f1117: 66 89 4f 18 mov %cx,0x18(%rdi)
f111b: 66 89 4f 1c mov %cx,0x1c(%rdi)
f111f: 48 83 c7 20 add $0x20,%rdi
f1123: 49 83 c3 f8 add $0xfffffffffffffff8,%r11
f1127: 75 d7 jne f1100
that said i'd initially taken an approach of re-slicing from lvl_buf in the innermost loop instead of incrementing x - that worked well too, except it kept maintaining the slice length (an additional sub each loop iteration) where your approach here doesn't do that. it is unfortunate that you get the slightly wider mov %reg, %sil,-0xd(%base,%index,4) stores though..
i also had a hard time precisely measuring this one. on a 9950x, but my change was an improvement on the order of 0.5% for the 10-bit video. on the 8-bit video where i've better numbers: ~1% fewer loads/stores dispatched (presumably mostly because of the loop unrolling), 0.6% fewer instructions in total, 0.25% total fewer cycles.
i would expect your change to have similar results, maybe a less change overall because of the slightly less aggressive loop unrolling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the help! I didn't notice, let me try to convince LLVM to fuse the store similar to what you're doing. Your numbers sound right - they're exactly what I would estimate based on the performance improvement to this function alone. Let me look at instruction/cycle counts, it may be more useful than wall clock as there's just too much noise to measure < 1% improvements consistently.
If you'd like to share thoughts/work together on completing this fix, let me know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On 10 bit Chimera for 4000 frames, I'm getting 1.8% fewer instructions and 0.5-1.2% wall clock compared to baseline now. The former seems to be reproducible/consistent, latter is still affected by noise. Seems better than yours which is suspicious, will take a closer look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think those observations make sense for the 10-bit video, and in particular on a 16-core processor. i noticed this weekend that dav1d and rav1d both struggle to keep threads busy past ~12 threads and for some inputs more threads past 18 start to harm throughput. haven't looked too much into that since the overall behavior is very consistent between the two, but it definitely seems like the overall work-scheduling situation could be.... much better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was also looking at this function yesterday, you might want to keep an eye out for ways to convince llvm/rustc that this store can be one 16-bit store instead of a pair of 8-bit stores (i took a "get out and push" approach while proving it out, myself: iximeow@e133a89)
I think I found a way to do this simpler and safer. I'll open a PR for just that part. Thanks for finding out the 2 u8 stores aren't being coalesced! I haven't looked at the rest of the loop in detail yet, but the stores I should be able to fix.
Previously, two adjacent 8-bit stores were done. By doing them as `u16`s instead of `[u8]`s/`[u8; 2]`s, we can get a single 16-bit store to be emitted instead. This gets `strh`s emitted instead of the pair of `strb`s previously. Based on what @iximeow found in iximeow@e133a89 and #1401 (comment). Before: ```asm ❯ cargo asm -p rav1d --lib --rust rav1d_create_lf_mask_intra --color | rg filter_level -C 10 if start <= end && end <= len { cmp x13, x12 b.hi .LBB440_79 // /home/khyber/work/rav1d/src/lf_mask.rs : 458 let lvl = &mut *level_cache.index_mut((idx + 0.., ..2)); ldr x12, [x1, #8] // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } subs x7, x7, #1 // /home/khyber/work/rav1d/src/lf_mask.rs : 459 lvl[0] = filter_level[0][0][0]; add x12, x12, x11 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } add x11, x11, #4 // /home/khyber/work/rav1d/src/lf_mask.rs : 459 lvl[0] = filter_level[0][0][0]; strb w15, [x12] // /home/khyber/work/rav1d/src/lf_mask.rs : 460 lvl[1] = filter_level[1][0][0]; strb w16, [x12, #1] // /home/khyber/work/rav1d/src/lf_mask.rs : 455 for x in 0..bw4 { b.ne .LBB440_4 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } cmp x6, x20 // /home/khyber/work/rav1d/src/lf_mask.rs : 454 for _y in 0..bh4 { add x17, x17, x18 -- // /home/khyber/work/rav1d/src/disjoint_mut.rs : 729 if start <= end && end <= len { add x20, x3, x19 add x21, x20, #3 cmp x21, x2 b.hs .LBB440_80 // /home/khyber/work/rav1d/src/lf_mask.rs : 496 let lvl = &mut *level_cache.index_mut((idx + 2.., ..2)); ldr x2, [x1, #8] // /home/khyber/work/rav1d/src/lf_mask.rs : 497 lvl[0] = filter_level[2][0][0]; add x2, x2, x3 add x2, x2, x19 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } add x19, x19, #4 // /home/khyber/work/rav1d/src/lf_mask.rs : 493 for x in 0..cbw4 { cmp x18, x19 // /home/khyber/work/rav1d/src/lf_mask.rs : 497 lvl[0] = filter_level[2][0][0]; strb w16, [x2, #2] // /home/khyber/work/rav1d/src/lf_mask.rs : 498 lvl[1] = filter_level[3][0][0]; strb w17, [x2, #3] // /home/khyber/work/rav1d/src/lf_mask.rs : 493 for x in 0..cbw4 { b.ne .LBB440_72 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } cmp x6, x4 // /home/khyber/work/rav1d/src/lf_mask.rs : 492 for _y in 0..cbh4 { add x3, x3, x5 ``` After: ```asm ❯ cargo asm -p rav1d --lib --rust rav1d_create_lf_mask_intra --color | rg filter_level -C 10 ldr x24, [sp, #224] ldrb w10, [sp, #184] // /home/khyber/work/rav1d/src/lf_mask.rs : 447 let b_dim = b_dim.map(|it| it as usize); ldr w8, [x9, x8, lsl #2] // /home/khyber/work/rav1d/src/lf_mask.rs : 448 let bw4 = cmp::min(iw - bx, b_dim[0]); sub x9, x25, x28 ldrb w6, [sp, #176] // /home/khyber/work/rav1d/src/lf_mask.rs : 453 let filter_level_yuv = filter_level.0.map(|a| a[0][0]); ldrb w23, [x3, #32] // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/array/mod.rs : 160 } ldrb w14, [x3, #48] // /home/khyber/work/rav1d/src/lf_mask.rs : 450 let bx4 = bx & 31; and x30, x28, #0x1f // /home/khyber/work/rav1d/src/lf_mask.rs : 447 let b_dim = b_dim.map(|it| it as usize); and x26, x8, #0xff -- let lvl = &mut *level_cache.index_mut((idx + 0.., ..2)); ldr x12, [x1, #8] add w13, w11, w12 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/option.rs : 1001 match self { tbnz w13, #0, .LBB440_21 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } subs x3, x3, #1 // /home/khyber/work/rav1d/src/lf_mask.rs : 463 *u16::mut_from(lvl).unwrap() = filter_level_y; strh w15, [x12, x11] // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } add x11, x11, #4 // /home/khyber/work/rav1d/src/lf_mask.rs : 459 for x in 0..bw4 { b.ne .LBB440_4 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } cmp x18, x20 -- tbnz w19, #0, .LBB440_82 add x2, x2, x17 add x2, x2, x21 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } add x21, x21, #4 // /home/khyber/work/rav1d/src/lf_mask.rs : 496 for x in 0..cbw4 { cmp x16, x21 // /home/khyber/work/rav1d/src/lf_mask.rs : 500 *u16::mut_from(lvl).unwrap() = filter_level_uv; strh w15, [x2, #2] // /home/khyber/work/rav1d/src/lf_mask.rs : 496 for x in 0..cbw4 { b.ne .LBB440_76 // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849 fn lt(&self, other: &Self) -> bool { *self < *other } cmp x5, x4 // /home/khyber/work/rav1d/src/lf_mask.rs : 495 for _y in 0..cbh4 { add x17, x17, x18 ```
Hi,
Heard about the perf contest, I noticed that create lf inter function seems to be slower in rav1d than dav1d and also generates poorer assembly code. Here's my take at an improvement, it isn't quite at parity with dav1d but seems to be significantly better now.
Assuming tests pass, I'll upload some more documentation to this PR later.


This speeds up create lf mask significantly (and consistently) but the total improvement in time is small, slightly less than 1% which is within the measurement noise on my machine (Ryzen 3700X) so I'm having trouble reliably benchmarking. That said I'm pretty confident it's a net benefit so I'm posting this; if someone can reproduce the performance improvement with lower noise that would be great.
I pulled in bytemuck to avoid an unsafe transmute, if you'd prefer not to have the dependency I can change it.