-
Notifications
You must be signed in to change notification settings - Fork 61
perf: fn rav1d_prepare_intra_edges: pre-slice
#1415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
perf: fn rav1d_prepare_intra_edges: pre-slice
#1415
Conversation
rework `rav1d_prepare_intra_edges` to precalculate and take a slice of the data to be iterated over so we only incur the overhead of taking the `DisjointImmutGuard` once, which also allows us to safely use the unsafe `unsafe fn get_unchecked(..)` within the loop avoiding the overhead of any runtime bounds checks, since the `slice()` opertaion already implcitly bounds checked the slice we're asking for. inital hyperfine results show a overall improvement (albeit fairly minisucle): ``` Benchmark 1: ./target/release/dav1d -q -i test-data/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null Time (mean ± σ): 9.384 s ± 0.057 s [User: 108.379 s, System: 6.500 s] Range (min … max): 9.342 s … 9.516 s 10 runs Benchmark 1: ./target.main/release/dav1d -q -i test-data/Chimera-AV1-10bit-1920x1080-6191kbps.ivf -o /dev/null Time (mean ± σ): 9.490 s ± 0.076 s [User: 110.062 s, System: 6.521 s] Range (min … max): 9.383 s … 9.606 s 10 runs ``` (changed version took ~98.88% of the original runtime)
|
Okay, looks like I missed running the test script with some flags that CI uses, so I'll take a look at that... However, this errror seems fairly concerning for the general concept of hoisting the borrow out of the loop: It seems like maybe there is no getting around taking a Although that makes me wonder if there is an intentional mechanism keeping those Otherwise, couldn't we still run into the same issue for the |
fix tests when running with `-n` flag run `cargo fmt`
|
Okay, updated to ensure tests pass while running the following locally: cargo build --release
rm -rf build
./.github/workflows/test.sh \
-r target/release/dav1d \
-s target/release/seek_stress
./.github/workflows/test.sh \
-r target/release/dav1d \
-s target/release/seek_stress \
-n
./.github/workflows/test.sh \
-r target/release/dav1d \
-s target/release/seek_stress \
-f 1
./.github/workflows/test.sh \
-r target/release/dav1d \
-s target/release/seek_stress \
-f 2However, Im not seeing the overlap issue locally, so that one is still rather concerning. |
kkysen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cargo build --release rm -rf build ./.github/workflows/test.sh \ -r target/release/dav1d \ -s target/release/seek_stress ./.github/workflows/test.sh \ -r target/release/dav1d \ -s target/release/seek_stress \ -n ./.github/workflows/test.sh \ -r target/release/dav1d \ -s target/release/seek_stress \ -f 1 ./.github/workflows/test.sh \ -r target/release/dav1d \ -s target/release/seek_stress \ -f 2However, Im not seeing the overlap issue locally, so that one is still rather concerning.
The overlap checks are only run with debug_assertions on (they should really be on always, but they are very bad for perf, and we don't know a better way to implement it), so you need to test with the opt-dev or checked-release profiles.
| 0 | ||
| }; | ||
| let dst_slice = | ||
| &*(dst + slice_offset - 1isize).slice::<BD>((px_have - 1) * abs_stride + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slice here overlaps ones from other threads, as the individual indices accessed are interleaved. That's why .index::<BD>() works, but not a single .slice::<BD>(). It's unfortunately a very tricky problem.
fn rav1d_prepare_intra_edges: pre-slice
fn prepare_intra_edgesis slower in rav1d than dav1d #1394.rework
rav1d_prepare_intra_edgesto precalculate and take a slice of the data to be iterated over so we only incur the overhead of taking theDisjointImmutGuardonce, which also allows us to safely use the unsafeunsafe fn get_unchecked(..)within the loop avoiding the overhead of any runtime bounds checks, since theslice()opertaion already implcitly bounds checked the slice we're asking for.inital hyperfine results show a overall improvement (albeit fairly minisucle):
(changed version took ~98.88% of the original runtime)