Skip to content

AES-CTR: 0.9.0-rc.2 is slower on AVX2-only CPUs than 0.8.4 #515

@starius

Description

@starius

I built a benchmark tool which measures throughput of AES-CTR on 8k buffer using various versions of aes.

I noticed a significant slowdown between 0.8.4 and 0.9.0-rc.2 versions. I think it is related to inlining in autodetect.rs and to switching from 8 to 9 blocks per run. I drafted the patch here where I restore 8 blocks per run and wrappers in autodetect.rs to the version used in 0.8.4. VAES code is still there, i.e. the fix is not a breaking change.

Below are performance numbers on two machines.

One machine (Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz):

0.8.4:      Avg: 3373.41 MiB/s | Median: 3478.85 | Min: 2683.36 | Max: 3677.39
0.9.0-rc.2: Avg: 2338.59 MiB/s | Median: 2393.62 | Min: 2066.86 | Max: 2459.26
fix:        Avg: 3598.64 MiB/s | Median: 3713.41 | Min: 2730.50 | Max: 3864.82

Another machine (AMD EPYC-Milan Processor):

0.8.4:      Avg: 7637.36 MiB/s | Median: 8301.11 | Min: 3398.54 | Max: 8330.20
0.9.0-rc.2: Avg: 4451.80 MiB/s | Median: 4979.17 | Min: 2435.00 | Max: 4986.76
fix:        Avg: 7601.95 MiB/s | Median: 8267.81 | Min: 3375.63 | Max: 8278.00

It was built with cargo build --release in all cases.

To reproduce this, run the following commits of my benchmark tool:

The only difference between them is versions of aes and ctr used.

I'm attaching the flamegraph generated for version 0.9.0-rc.2. It demonstrates that 24% of time is spent in <cipher::stream::wrapper::StreamCipherCoreWrapper<T> as cipher::stream::StreamCipher>::try_apply_keystream_inout

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions