Enable four block aggregated GCM Intel PCLMUL implementation on i386
authorJussi Kivilinna <jussi.kivilinna@iki.fi>
Sat, 27 Apr 2019 18:38:00 +0000 (21:38 +0300)
committerJussi Kivilinna <jussi.kivilinna@iki.fi>
Sat, 27 Apr 2019 19:04:37 +0000 (22:04 +0300)
commita6e7c411e5f67a9473675ca8d49017a4d13a8d3e
tree90d14aa094cd86007664538a0473ce44f942b24b
parent1374254c2904ab5b18ba4a890856824a102d4705
Enable four block aggregated GCM Intel PCLMUL implementation on i386

* cipher/cipher-gcm-intel-pclmul.c (reduction): Change "%%xmm7" to
"%%xmm5".
(gfmul_pclmul_aggr4): Move outside [__x86_64__] block; Remove usage of
XMM8-XMM15 registers; Do not preload H-values and be_mask to reduce
register usage for i386.
(_gcry_ghash_setup_intel_pclmul): Enable calculation of H2, H3 and H4
on i386.
(_gcry_ghash_intel_pclmul): Adjust to above gfmul_pclmul_aggr4
changes; Move 'aggr4' code path outside [__x86_64__] block.
--

Benchmark on Intel Haswell (win32):

Before:
                    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 GMAC_AES           |     0.446 ns/B      2140 MiB/s      1.78 c/B      3998

After (~2.38x faster):
                    |  nanosecs/byte   mebibytes/sec   cycles/byte  auto Mhz
 GMAC_AES           |     0.187 ns/B      5107 MiB/s     0.747 c/B      3998

Signed-off-by: Jussi Kivilinna <jussi.kivilinna@iki.fi>
cipher/cipher-gcm-intel-pclmul.c