Skip to content

Conversation

@coderkalyan
Copy link

@coderkalyan coderkalyan commented May 19, 2025

Hi,

Heard about the perf contest, I noticed that create lf inter function seems to be slower in rav1d than dav1d and also generates poorer assembly code. Here's my take at an improvement, it isn't quite at parity with dav1d but seems to be significantly better now.

Assuming tests pass, I'll upload some more documentation to this PR later.
image (1)
image (2)

This speeds up create lf mask significantly (and consistently) but the total improvement in time is small, slightly less than 1% which is within the measurement noise on my machine (Ryzen 3700X) so I'm having trouble reliably benchmarking. That said I'm pretty confident it's a net benefit so I'm posting this; if someone can reproduce the performance improvement with lower noise that would be great.

I pulled in bytemuck to avoid an unsafe transmute, if you'd prefer not to have the dependency I can change it.

Pulls the level_cache.index_mut slice out of the inner (x) loop
since accesses are contiguous from 0..bw4. This reduces overhead
and allows the compiler to unroll the inner loop 4 times.

Also annotates the slice used to access the level cache as a &mut [[u8; 4]].
This should aid unrolling and allow the compiler to speculatively load
both elements, avoiding the unnecessary bounds check each time.
use crate::src::levels::TxfmSize;
use crate::src::relaxed_atomic::RelaxedAtomic;
use crate::src::tables::dav1d_txfm_dimensions;
use bytemuck;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already using zerocopy, so if you can do these operations with zerocopy instead of bytemuck that would be much preferred.

Copy link
Author

@coderkalyan coderkalyan May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if zerocopy can do the same thing. If not it's also a "relatively safe" unsafe mem::transmute call.

Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what's happening here is that this is faster, but is creating overlapping mutable slices (see the debug panics in CI). If we can figure out a way to do this without creating those overlapping slices, perhaps with some more unsafe if we have to, that should be good. I'll take a closer look some more, but I just wanted to give you my initial thoughts.

Thanks for finding this, though!

@kkysen kkysen changed the title create_lf_mask_inter: pull level_cache slice out of inner loop fn create_lf_mask_inter: pull level_cache slice out of inner loop May 19, 2025
@coderkalyan
Copy link
Author

coderkalyan commented May 19, 2025

I'm somewhat surprised that it creates overlapping references since the accesses are contiguous, most likely it's between the two loops in the function. I'll look into this a bit more.

My other question would be why exactly taking separate references causes overhead, but my guess is its difficult for the compiler to speculatively load past them which gets in the way of unrolling and eliding the bounds checks.

If it doesn't work out we can ignore, it's not a huge improvement but every bit counts when you're chasing 5% :) I'll keep chipping away at things in my spare time.

Comment on lines +461 to +462
lvl[x][0] = filter_level[0][0][0];
lvl[x][1] = filter_level[1][0][0];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was also looking at this function yesterday, you might want to keep an eye out for ways to convince llvm/rustc that this store can be one 16-bit store instead of a pair of 8-bit stores (i took a "get out and push" approach while proving it out, myself: iximeow@e133a89)

the factor by which the loop is unrolled is partially a question of the size of the loop, so where this produces pairs of 5-byte stores for each iteration:

   f10e9: e9 72 ff ff ff        jmp    f1060
   f10ee: 66 90                 xchg   %ax,%ax
   f10f0: 4a 8d 04 32           lea    (%rdx,%r14,1),%rax
   f10f4: 45 31 ff              xor    %r15d,%r15d
   f10f7: 66 0f 1f 84 00 00 00  nopw   0x0(%rax,%rax,1)
   f10fe: 00 00
   f1100: 42 88 74 b8 f3        mov    %sil,-0xd(%rax,%r15,4)
   f1105: 42 88 5c b8 f4        mov    %bl,-0xc(%rax,%r15,4)
   f110a: 42 88 74 b8 f7        mov    %sil,-0x9(%rax,%r15,4)
   f110f: 42 88 5c b8 f8        mov    %bl,-0x8(%rax,%r15,4)
   f1114: 42 88 74 b8 fb        mov    %sil,-0x5(%rax,%r15,4)
   f1119: 42 88 5c b8 fc        mov    %bl,-0x4(%rax,%r15,4)
   f111e: 42 88 74 b8 ff        mov    %sil,-0x1(%rax,%r15,4)
   f1123: 42 88 1c b8           mov    %bl,(%rax,%r15,4)
   f1127: 49 83 c7 04           add    $0x4,%r15
   f112b: 4d 39 fb              cmp    %r15,%r11
   f112e: 75 d0                 jne    f1100 

i'd seen slightly better still throughput with the iteration being expressed as one 4-byte store:

   f10f9: 72 2e                 jb     f1129 <_ZN5rav1d3src7lf_mask26rav1d_create_lf_mask_inter17h52008c9dbd379820E+0x1b9>
   f10fb: 4d 89 cb              mov    %r9,%r11
   f10fe: 66 90                 xchg   %ax,%ax
   f1100: 66 89 0f              mov    %cx,(%rdi)
   f1103: 66 89 4f 04           mov    %cx,0x4(%rdi)
   f1107: 66 89 4f 08           mov    %cx,0x8(%rdi)
   f110b: 66 89 4f 0c           mov    %cx,0xc(%rdi)
   f110f: 66 89 4f 10           mov    %cx,0x10(%rdi)
   f1113: 66 89 4f 14           mov    %cx,0x14(%rdi)
   f1117: 66 89 4f 18           mov    %cx,0x18(%rdi)
   f111b: 66 89 4f 1c           mov    %cx,0x1c(%rdi)
   f111f: 48 83 c7 20           add    $0x20,%rdi
   f1123: 49 83 c3 f8           add    $0xfffffffffffffff8,%r11
   f1127: 75 d7                 jne    f1100

that said i'd initially taken an approach of re-slicing from lvl_buf in the innermost loop instead of incrementing x - that worked well too, except it kept maintaining the slice length (an additional sub each loop iteration) where your approach here doesn't do that. it is unfortunate that you get the slightly wider mov %reg, %sil,-0xd(%base,%index,4) stores though..

i also had a hard time precisely measuring this one. on a 9950x, but my change was an improvement on the order of 0.5% for the 10-bit video. on the 8-bit video where i've better numbers: ~1% fewer loads/stores dispatched (presumably mostly because of the loop unrolling), 0.6% fewer instructions in total, 0.25% total fewer cycles.

i would expect your change to have similar results, maybe a less change overall because of the slightly less aggressive loop unrolling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the help! I didn't notice, let me try to convince LLVM to fuse the store similar to what you're doing. Your numbers sound right - they're exactly what I would estimate based on the performance improvement to this function alone. Let me look at instruction/cycle counts, it may be more useful than wall clock as there's just too much noise to measure < 1% improvements consistently.

If you'd like to share thoughts/work together on completing this fix, let me know!

Copy link
Author

@coderkalyan coderkalyan May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 10 bit Chimera for 4000 frames, I'm getting 1.8% fewer instructions and 0.5-1.2% wall clock compared to baseline now. The former seems to be reproducible/consistent, latter is still affected by noise. Seems better than yours which is suspicious, will take a closer look.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think those observations make sense for the 10-bit video, and in particular on a 16-core processor. i noticed this weekend that dav1d and rav1d both struggle to keep threads busy past ~12 threads and for some inputs more threads past 18 start to harm throughput. haven't looked too much into that since the overall behavior is very consistent between the two, but it definitely seems like the overall work-scheduling situation could be.... much better

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was also looking at this function yesterday, you might want to keep an eye out for ways to convince llvm/rustc that this store can be one 16-bit store instead of a pair of 8-bit stores (i took a "get out and push" approach while proving it out, myself: iximeow@e133a89)

I think I found a way to do this simpler and safer. I'll open a PR for just that part. Thanks for finding out the 2 u8 stores aren't being coalesced! I haven't looked at the rest of the loop in detail yet, but the stores I should be able to fix.

kkysen added a commit that referenced this pull request Jun 16, 2025
Previously, two adjacent 8-bit stores were done. By doing them as `u16`s
instead of `[u8]`s/`[u8; 2]`s, we can get a single 16-bit store to be
emitted instead.

This gets `strh`s emitted instead of the pair of `strb`s previously.

Based on what @iximeow found in
iximeow@e133a89
and
#1401 (comment).

Before:

```asm
❯ cargo asm -p rav1d --lib --rust rav1d_create_lf_mask_intra --color | rg filter_level -C 10
                if start <= end && end <= len {
        cmp x13, x12
        b.hi .LBB440_79
                // /home/khyber/work/rav1d/src/lf_mask.rs : 458
                let lvl = &mut *level_cache.index_mut((idx + 0.., ..2));
        ldr x12, [x1, #8]
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849
                fn lt(&self, other: &Self) -> bool { *self <  *other }
        subs x7, x7, #1
                // /home/khyber/work/rav1d/src/lf_mask.rs : 459
                lvl[0] = filter_level[0][0][0];
        add x12, x12, x11
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849
                fn lt(&self, other: &Self) -> bool { *self <  *other }
        add x11, x11, #4
                // /home/khyber/work/rav1d/src/lf_mask.rs : 459
                lvl[0] = filter_level[0][0][0];
        strb w15, [x12]
                // /home/khyber/work/rav1d/src/lf_mask.rs : 460
                lvl[1] = filter_level[1][0][0];
        strb w16, [x12, #1]
                // /home/khyber/work/rav1d/src/lf_mask.rs : 455
                for x in 0..bw4 {
        b.ne .LBB440_4
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849
                fn lt(&self, other: &Self) -> bool { *self <  *other }
        cmp x6, x20
                // /home/khyber/work/rav1d/src/lf_mask.rs : 454
                for _y in 0..bh4 {
        add x17, x17, x18
--
                // /home/khyber/work/rav1d/src/disjoint_mut.rs : 729
                if start <= end && end <= len {
        add x20, x3, x19
        add x21, x20, #3
        cmp x21, x2
        b.hs .LBB440_80
                // /home/khyber/work/rav1d/src/lf_mask.rs : 496
                let lvl = &mut *level_cache.index_mut((idx + 2.., ..2));
        ldr x2, [x1, #8]
                // /home/khyber/work/rav1d/src/lf_mask.rs : 497
                lvl[0] = filter_level[2][0][0];
        add x2, x2, x3
        add x2, x2, x19
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849
                fn lt(&self, other: &Self) -> bool { *self <  *other }
        add x19, x19, #4
                // /home/khyber/work/rav1d/src/lf_mask.rs : 493
                for x in 0..cbw4 {
        cmp x18, x19
                // /home/khyber/work/rav1d/src/lf_mask.rs : 497
                lvl[0] = filter_level[2][0][0];
        strb w16, [x2, #2]
                // /home/khyber/work/rav1d/src/lf_mask.rs : 498
                lvl[1] = filter_level[3][0][0];
        strb w17, [x2, #3]
                // /home/khyber/work/rav1d/src/lf_mask.rs : 493
                for x in 0..cbw4 {
        b.ne .LBB440_72
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849
                fn lt(&self, other: &Self) -> bool { *self <  *other }
        cmp x6, x4
                // /home/khyber/work/rav1d/src/lf_mask.rs : 492
                for _y in 0..cbh4 {
        add x3, x3, x5
```

After:

```asm
❯ cargo asm -p rav1d --lib --rust rav1d_create_lf_mask_intra --color | rg filter_level -C 10
        ldr x24, [sp, #224]
        ldrb w10, [sp, #184]
                // /home/khyber/work/rav1d/src/lf_mask.rs : 447
                let b_dim = b_dim.map(|it| it as usize);
        ldr w8, [x9, x8, lsl #2]
                // /home/khyber/work/rav1d/src/lf_mask.rs : 448
                let bw4 = cmp::min(iw - bx, b_dim[0]);
        sub x9, x25, x28
        ldrb w6, [sp, #176]
                // /home/khyber/work/rav1d/src/lf_mask.rs : 453
                let filter_level_yuv = filter_level.0.map(|a| a[0][0]);
        ldrb w23, [x3, #32]
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/array/mod.rs : 160
                }
        ldrb w14, [x3, #48]
                // /home/khyber/work/rav1d/src/lf_mask.rs : 450
                let bx4 = bx & 31;
        and x30, x28, #0x1f
                // /home/khyber/work/rav1d/src/lf_mask.rs : 447
                let b_dim = b_dim.map(|it| it as usize);
        and x26, x8, #0xff
--
                let lvl = &mut *level_cache.index_mut((idx + 0.., ..2));
        ldr x12, [x1, #8]
        add w13, w11, w12
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/option.rs : 1001
                match self {
        tbnz w13, #0, .LBB440_21
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849
                fn lt(&self, other: &Self) -> bool { *self <  *other }
        subs x3, x3, #1
                // /home/khyber/work/rav1d/src/lf_mask.rs : 463
                *u16::mut_from(lvl).unwrap() = filter_level_y;
        strh w15, [x12, x11]
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849
                fn lt(&self, other: &Self) -> bool { *self <  *other }
        add x11, x11, #4
                // /home/khyber/work/rav1d/src/lf_mask.rs : 459
                for x in 0..bw4 {
        b.ne .LBB440_4
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849
                fn lt(&self, other: &Self) -> bool { *self <  *other }
        cmp x18, x20
--
        tbnz w19, #0, .LBB440_82
        add x2, x2, x17
        add x2, x2, x21
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849
                fn lt(&self, other: &Self) -> bool { *self <  *other }
        add x21, x21, #4
                // /home/khyber/work/rav1d/src/lf_mask.rs : 496
                for x in 0..cbw4 {
        cmp x16, x21
                // /home/khyber/work/rav1d/src/lf_mask.rs : 500
                *u16::mut_from(lvl).unwrap() = filter_level_uv;
        strh w15, [x2, #2]
                // /home/khyber/work/rav1d/src/lf_mask.rs : 496
                for x in 0..cbw4 {
        b.ne .LBB440_76
                // /home/khyber/.rustup/toolchains/nightly-2025-05-01-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/cmp.rs : 1849
                fn lt(&self, other: &Self) -> bool { *self <  *other }
        cmp x5, x4
                // /home/khyber/work/rav1d/src/lf_mask.rs : 495
                for _y in 0..cbh4 {
        add x17, x17, x18
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants