crypto: x86/chacha20 - Add a 4-block AVX2 variant

This variant builds upon the idea of the 2-block AVX2 variant that shuffles words after each round. The shuffling has a rather high latency, so the arithmetic units are not optimally used. Given that we have plenty of registers in AVX, this version parallelizes the 2-block variant to do four blocks. While the first two blocks are shuffling, the CPU can do the XORing on the second two blocks and vice-versa, which makes this version much faster than the SSSE3 variant for four blocks. The latter is now mostly for systems that do not have AVX2, but there it is the work-horse, so we keep it in place. The partial XORing function trailer is very similar to the AVX2 2-block variant. While it could be shared, that code segment is rather short; profiling is also easier with the trailer integrated, so we keep it per function. Signed-off-by: Martin Willi <martin@strongswan.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
author: Martin Willi <martin@strongswan.org> 2018-11-11 10:36:30 +0100
committer: Herbert Xu <herbert@gondor.apana.org.au> 2018-11-16 14:11:04 +0800
commit: 8a5a79d5556b822143b4403fc46068d4eef2e4e2 (patch)
tree: bbbc738cd20c7f2a461b26b3be6fc3abde942710 /arch/x86/crypto/chacha20_glue.c
parent: crypto: x86/chacha20 - Add a 2-block AVX2 variant (diff)
download: linux-dev-8a5a79d5556b822143b4403fc46068d4eef2e4e2.tar.xz
linux-dev-8a5a79d5556b822143b4403fc46068d4eef2e4e2.zip
1 files changed, 7 insertions, 0 deletions
diff --git a/arch/x86/crypto/chacha20_glue.c b/arch/x86/crypto/chacha20_glue.c
index 82e46589a189..9fd84fe6ec09 100644
--- a/arch/x86/crypto/chacha20_glue.c
+++ b/arch/x86/crypto/chacha20_glue.c
@@ -26,6 +26,8 @@ asmlinkage void chacha20_4block_xor_ssse3(u32 *state, u8 *dst, const u8 *src,
 #ifdef CONFIG_AS_AVX2
 asmlinkage void chacha20_2block_xor_avx2(u32 *state, u8 *dst, const u8 *src,
 					 unsigned int len);
+asmlinkage void chacha20_4block_xor_avx2(u32 *state, u8 *dst, const u8 *src,
+					 unsigned int len);
 asmlinkage void chacha20_8block_xor_avx2(u32 *state, u8 *dst, const u8 *src,
 					 unsigned int len);
 static bool chacha20_use_avx2;
@@ -54,6 +56,11 @@ static void chacha20_dosimd(u32 *state, u8 *dst, const u8 *src,
 			state[12] += chacha20_advance(bytes, 8);
 			return;
 		}
+		if (bytes > CHACHA20_BLOCK_SIZE * 2) {
+			chacha20_4block_xor_avx2(state, dst, src, bytes);
+			state[12] += chacha20_advance(bytes, 4);
+			return;
+		}
 		if (bytes > CHACHA20_BLOCK_SIZE) {
 			chacha20_2block_xor_avx2(state, dst, src, bytes);
 			state[12] += chacha20_advance(bytes, 2);
author	Martin Willi <martin@strongswan.org>	2018-11-11 10:36:30 +0100
committer	Herbert Xu <herbert@gondor.apana.org.au>	2018-11-16 14:11:04 +0800
commit	8a5a79d5556b822143b4403fc46068d4eef2e4e2 (patch)
tree	bbbc738cd20c7f2a461b26b3be6fc3abde942710 /arch/x86/crypto/chacha20_glue.c
parent	crypto: x86/chacha20 - Add a 2-block AVX2 variant (diff)
download	linux-dev-8a5a79d5556b822143b4403fc46068d4eef2e4e2.tar.xz linux-dev-8a5a79d5556b822143b4403fc46068d4eef2e4e2.zip