Skip to content

Conversation

@ERobsham
Copy link

@ERobsham ERobsham commented May 25, 2025

rework rav1d_prepare_intra_edges to precalculate and take a slice of the data to be iterated over so we only incur the overhead of taking the DisjointImmutGuard once, which also allows us to safely use the unsafe unsafe fn get_unchecked(..) within the loop avoiding the overhead of any runtime bounds checks, since the slice() opertaion already implcitly bounds checked the slice we're asking for.

inital hyperfine results show a overall improvement (albeit fairly minisucle):

Benchmark 1: ./target/release/dav1d -q -i test-data/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
  Time (mean ± σ):      9.384 s ±  0.057 s    [User: 108.379 s, System: 6.500 s]
  Range (min … max):    9.342 s …  9.516 s    10 runs

Benchmark 1: ./target.main/release/dav1d -q -i test-data/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
  Time (mean ± σ):      9.490 s ±  0.076 s    [User: 110.062 s, System: 6.521 s]
  Range (min … max):    9.383 s …  9.606 s    10 runs

(changed version took ~98.88% of the original runtime)

rework `rav1d_prepare_intra_edges` to precalculate and take a slice of
the data to be iterated over so we only incur the overhead of taking the
`DisjointImmutGuard` once, which also allows us to safely use the unsafe
`unsafe fn get_unchecked(..)` within the loop avoiding the overhead of
any runtime bounds checks, since the `slice()` opertaion already implcitly
bounds checked the slice we're asking for.

inital hyperfine results show a overall improvement (albeit fairly minisucle):
```
Benchmark 1: ./target/release/dav1d -q -i test-data/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
  Time (mean ± σ):      9.384 s ±  0.057 s    [User: 108.379 s, System: 6.500 s]
  Range (min … max):    9.342 s …  9.516 s    10 runs

Benchmark 1: ./target.main/release/dav1d -q -i test-data/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null
  Time (mean ± σ):      9.490 s ±  0.076 s    [User: 110.062 s, System: 6.521 s]
  Range (min … max):    9.383 s …  9.606 s    10 runs
```
(changed version took ~98.88% of the original runtime)
@ERobsham
Copy link
Author

Okay, looks like I missed running the test script with some flags that CI uses, so I'll take a look at that...

However, this errror seems fairly concerning for the general concept of hoisting the borrow out of the loop:

thread 'rav1d-worker-0' panicked at src/disjoint_mut.rs:837:13:
	overlapping DisjointMut:
 current:    & _[734..35296] on ThreadId(2) at src/ipred_prepare.rs:253:22
existing: &mut _[25920..25984] on ThreadId(3) at src/itx.rs:98:33

It seems like maybe there is no getting around taking a DisjointImmutGuard at each iteration?

Although that makes me wonder if there is an intentional mechanism keeping those index() calls from overlapping a current mut borrow that isn't working the same for slice()?

Otherwise, couldn't we still run into the same issue for the index() calls if that other thread is ever allowed to take a mut borrow from the same chunk we're looping over here? (just with a lower likelihood of hitting this case, since its a smaller chunk getting immutably borrowed at once)

fix tests when running with `-n` flag
run `cargo fmt`
@ERobsham
Copy link
Author

ERobsham commented May 25, 2025

Okay, updated to ensure tests pass while running the following locally:

cargo build --release
rm -rf build

./.github/workflows/test.sh \
    -r target/release/dav1d \
    -s target/release/seek_stress

./.github/workflows/test.sh \
    -r target/release/dav1d \
    -s target/release/seek_stress \
    -n

./.github/workflows/test.sh \
    -r target/release/dav1d \
    -s target/release/seek_stress \
    -f 1

./.github/workflows/test.sh \
    -r target/release/dav1d \
    -s target/release/seek_stress \
    -f 2

However, Im not seeing the overlap issue locally, so that one is still rather concerning.

Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cargo build --release
rm -rf build

./.github/workflows/test.sh \
    -r target/release/dav1d \
    -s target/release/seek_stress

./.github/workflows/test.sh \
    -r target/release/dav1d \
    -s target/release/seek_stress \
    -n

./.github/workflows/test.sh \
    -r target/release/dav1d \
    -s target/release/seek_stress \
    -f 1

./.github/workflows/test.sh \
    -r target/release/dav1d \
    -s target/release/seek_stress \
    -f 2

However, Im not seeing the overlap issue locally, so that one is still rather concerning.

The overlap checks are only run with debug_assertions on (they should really be on always, but they are very bad for perf, and we don't know a better way to implement it), so you need to test with the opt-dev or checked-release profiles.

0
};
let dst_slice =
&*(dst + slice_offset - 1isize).slice::<BD>((px_have - 1) * abs_stride + 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slice here overlaps ones from other threads, as the individual indices accessed are interleaved. That's why .index::<BD>() works, but not a single .slice::<BD>(). It's unfortunately a very tricky problem.

@kkysen kkysen changed the title perf improvement re: issue #1394 perf: fn rav1d_prepare_intra_edges: pre-slice Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants