aboutsummaryrefslogtreecommitdiffstats
path: root/arch (follow)
AgeCommit message (Collapse)AuthorFilesLines
2021-01-22crypto: aesni - release FPU during skcipher walk API callsArd Biesheuvel1-41/+32
Taking ownership of the FPU in kernel mode disables preemption, and this may result in excessive scheduling blackouts if the size of the data being processed on the FPU is unbounded. Given that taking and releasing the FPU is cheap these days on x86, we can limit the impact of this issue easily for skcipher implementations, by moving the FPU begin/end calls inside the skcipher walk processing loop. Considering that skcipher walks operate on at most one page at a time, doing so fully mitigates this issue. This also permits the skcipher walk logic to use non-atomic kmalloc() calls etc so we can change the 'atomic' bool argument in the calls to skcipher_walk_virt() to false as well. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-22crypto: aesni - replace CTR function pointer with static callArd Biesheuvel1-6/+7
Indirect calls are very expensive on x86, so use a static call to set the system-wide AES-NI/CTR asm helper. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-22crypto: arm64/sha - add missing module aliasesArd Biesheuvel4-0/+9
The accelerated, instruction based implementations of SHA1, SHA2 and SHA3 are autoloaded based on CPU capabilities, given that the code is modest in size, and widely used, which means that resolving the algo name, loading all compatible modules and picking the one with the highest priority is taken to be suboptimal. However, if these algorithms are requested before this CPU feature based matching and autoloading occurs, these modules are not even considered, and we end up with suboptimal performance. So add the missing module aliases for the various SHA implementations. Cc: <stable@vger.kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86 - use local headers for x86 specific shared declarationsArd Biesheuvel12-8/+8
The Camellia, Serpent and Twofish related header files only contain declarations that are shared between different implementations of the respective algorithms residing under arch/x86/crypto, and none of their contents should be used elsewhere. So move the header files into the same location, and use local #includes instead. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86 - remove glue helper moduleArd Biesheuvel3-231/+0
All dependencies on the x86 glue helper module have been replaced by local instantiations of the new ECB/CBC preprocessor helper macros, so the glue helper module can be retired. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/twofish - drop dependency on glue helperArd Biesheuvel2-109/+44
Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/cast6 - drop dependency on glue helperArd Biesheuvel1-44/+17
Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/cast5 - drop dependency on glue helperArd Biesheuvel1-167/+17
Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/serpent - drop dependency on glue helperArd Biesheuvel3-154/+61
Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/camellia - drop dependency on glue helperArd Biesheuvel3-166/+67
Replace the glue helper dependency with implementations of ECB and CBC based on the new CPP macros, which avoid the need for indirect calls. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86 - add some helper macros for ECB and CBC modesArd Biesheuvel1-0/+76
The x86 glue helper module is starting to show its age: - It relies heavily on function pointers to invoke asm helper functions that operate on fixed input sizes that are relatively small. This means the performance is severely impacted by retpolines. - It goes to great lengths to amortize the cost of kernel_fpu_begin()/end() over as much work as possible, which is no longer necessary now that FPU save/restore is done lazily, and doing so may cause unbounded scheduling blackouts due to the fact that enabling the FPU in kernel mode disables preemption. - The CBC mode decryption helper makes backward strides through the input, in order to avoid a single block size memcpy() between chunks. Consuming the input in this manner is highly likely to defeat any hardware prefetchers, so it is better to go through the data linearly, and perform the extra memcpy() where needed (which is turned into direct loads and stores by the compiler anyway). Note that benchmarks won't show this effect, given that the memory they use is always cache hot. - It implements blockwise XOR in terms of le128 pointers, which imply an alignment that is not guaranteed by the API, violating the C standard. GCC does not seem to be smart enough to elide the indirect calls when the function pointers are passed as arguments to static inline helper routines modeled after the existing ones. So instead, let's create some CPP macros that encapsulate the core of the ECB and CBC processing, so we can wire them up for existing users of the glue helper module, i.e., Camellia, Serpent, Twofish and CAST6. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/blowfish - drop CTR mode implementationArd Biesheuvel1-107/+0
Blowfish in counter mode is never used in the kernel, so there is no point in keeping an accelerated implementation around. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/des - drop CTR mode implementationArd Biesheuvel1-104/+0
DES or Triple DES in counter mode is never used in the kernel, so there is no point in keeping an accelerated implementation around. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/glue-helper - drop CTR helper routinesArd Biesheuvel4-207/+0
The glue helper's CTR routines are no longer used, so drop them. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/twofish - drop CTR mode implementationArd Biesheuvel4-147/+0
Twofish in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/cast6 - drop CTR mode implementationArd Biesheuvel2-76/+0
CAST6 in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/cast5 - drop CTR mode implementationArd Biesheuvel1-103/+0
CAST5 in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/serpent - drop CTR mode implementationArd Biesheuvel5-201/+0
Serpent in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/camellia - drop CTR mode implementationArd Biesheuvel6-416/+0
Camellia in CTR mode is never used by the kernel directly, and is highly unlikely to be relied upon by dm-crypt or algif_skcipher. So let's drop the accelerated CTR mode implementation, and instead, rely on the CTR template and the bare cipher. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/glue-helper - drop XTS helper routinesArd Biesheuvel4-303/+0
The glue helper's XTS routines are no longer used, so drop them. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/twofish - switch to XTS templateArd Biesheuvel2-151/+0
Now that the XTS template can wrap accelerated ECB modes, it can be used to implement Twofish in XTS mode as well, which turns out to be at least as fast, and sometimes even faster Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/serpent- switch to XTS templateArd Biesheuvel5-304/+0
Now that the XTS template can wrap accelerated ECB modes, it can be used to implement Serpent in XTS mode as well, which turns out to be at least as fast, and sometimes even faster Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/cast6 - switch to XTS templateArd Biesheuvel2-154/+0
Now that the XTS template can wrap accelerated ECB modes, it can be used to implement CAST6 in XTS mode as well, which turns out to be at least as fast, and sometimes even faster Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: x86/camellia - switch to XTS templateArd Biesheuvel5-576/+1
Now that the XTS template can wrap accelerated ECB modes, it can be used to implement Camellia in XTS mode as well, which turns out to be at least as fast, and sometimes even faster. Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: aesni - replace function pointers with static branchesArd Biesheuvel1-44/+54
Replace the function pointers in the GCM implementation with static branches, which are based on code patching, which occurs only at module load time. This avoids the severe performance penalty caused by the use of retpolines. In order to retain the ability to switch between different versions of the implementation based on the input size on cores that support AVX and AVX2, use static branches instead of static calls. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: aesni - refactor scatterlist processingArd Biesheuvel1-83/+56
Currently, the gcm(aes-ni) driver open codes the scatterlist handling that is encapsulated by the skcipher walk API. So let's switch to that instead. Also, move the handling at the end of gcmaes_crypt_by_sg() that is dependent on whether we are encrypting or decrypting into the callers, which always do one or the other. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: aesni - clean up mapping of associated dataArd Biesheuvel1-4/+5
The gcm(aes-ni) driver is only built for x86_64, which does not make use of highmem. So testing for PageHighMem is pointless and can be omitted. While at it, replace GFP_ATOMIC with the appropriate runtime decided value based on the context. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: aesni - drop unused asm prototypesArd Biesheuvel1-67/+0
Drop some prototypes that are declared but never called. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-14crypto: aesni - prevent misaligned buffers on the stackArd Biesheuvel1-12/+16
The GCM mode driver uses 16 byte aligned buffers on the stack to pass the IV to the asm helpers, but unfortunately, the x86 port does not guarantee that the stack pointer is 16 byte aligned upon entry in the first place. Since the compiler is not aware of this, it will not emit the additional stack realignment sequence that is needed, and so the alignment is not guaranteed to be more than 8 bytes. So instead, allocate some padding on the stack, and realign the IV pointer by hand. Cc: <stable@vger.kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-08crypto: x86/aes-ni-xts - rewrite and drop indirections via glue helperArd Biesheuvel2-144/+356
The AES-NI driver implements XTS via the glue helper, which consumes a struct with sets of function pointers which are invoked on chunks of input data of the appropriate size, as annotated in the struct. Let's get rid of this indirection, so that we can perform direct calls to the assembler helpers. Instead, let's adopt the arm64 strategy, i.e., provide a helper which can consume inputs of any size, provided that the penultimate, full block is passed via the last call if ciphertext stealing needs to be applied. This also allows us to enable the XTS mode for i386. Tested-by: Eric Biggers <ebiggers@google.com> # x86_64 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-08crypto: x86/aes-ni-xts - use direct calls to and 4-way strideArd Biesheuvel2-56/+84
The XTS asm helper arrangement is a bit odd: the 8-way stride helper consists of back-to-back calls to the 4-way core transforms, which are called indirectly, based on a boolean that indicates whether we are performing encryption or decryption. Given how costly indirect calls are on x86, let's switch to direct calls, and given how the 8-way stride doesn't really add anything substantial, use a 4-way stride instead, and make the asm core routine deal with any multiple of 4 blocks. Since 512 byte sectors or 4 KB blocks are the typical quantities XTS operates on, increase the stride exported to the glue helper to 512 bytes as well. As a result, the number of indirect calls is reduced from 3 per 64 bytes of in/output to 1 per 512 bytes of in/output, which produces a 65% speedup when operating on 1 KB blocks (measured on a Intel(R) Core(TM) i7-8650U CPU) Fixes: 9697fa39efd3f ("x86/retpoline/crypto: Convert crypto assembler indirect jumps") Tested-by: Eric Biggers <ebiggers@google.com> # x86_64 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: arm/blake2b - add NEON-accelerated BLAKE2bEric Biggers4-0/+464
Add a NEON-accelerated implementation of BLAKE2b. On Cortex-A7 (which these days is the most common ARM processor that doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as SHA-256, and slightly faster than SHA-1. It is also almost three times as fast as the generic implementation of BLAKE2b: Algorithm Cycles per byte (on 4096-byte messages) =================== ======================================= blake2b-256-neon 14.0 sha1-neon 16.3 blake2s-256-arm 18.8 sha1-asm 20.8 blake2s-256-generic 26.0 sha256-neon 28.9 sha256-asm 32.0 blake2b-256-generic 38.9 This implementation isn't directly based on any other implementation, but it borrows some ideas from previous NEON code I've written as well as from chacha-neon-core.S. At least on Cortex-A7, it is faster than the other NEON implementations of BLAKE2b I'm aware of (the implementation in the BLAKE2 official repository using intrinsics, and Andrew Moon's implementation which can be found in SUPERCOP). It does only one block at a time, so it performs well on short messages too. NEON-accelerated BLAKE2b is useful because there is interest in using BLAKE2b-256 for dm-verity on low-end Android devices (specifically, devices that lack the ARMv8 Crypto Extensions) to replace SHA-1. On these devices, the performance cost of upgrading to SHA-256 may be unacceptable, whereas BLAKE2b-256 would actually improve performance. Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which is intended for 32-bit platforms), on 32-bit ARM processors with NEON, BLAKE2b is actually faster than BLAKE2s. This is because NEON supports 64-bit operations, and because BLAKE2s's block size is too small for NEON to be helpful for it. The best I've been able to do with BLAKE2s on Cortex-A7 is 18.8 cpb with an optimized scalar implementation. (I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but they're more complex as they require running multiple hashes at once. Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7, so I expect that any speedup from BLAKE2sp or BLAKE3 would come only from the smaller number of rounds, not from the extra parallelism.) For now this BLAKE2b implementation is only wired up to the shash API, since there is no library API for BLAKE2b yet. However, I've tried to keep things consistent with BLAKE2s, e.g. by defining blake2b_compress_arch() which is analogous to blake2s_compress_arch() and could be exported for use by the library API later if needed. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: arm/blake2s - add ARM scalar optimized BLAKE2sEric Biggers4-0/+374
Add an ARM scalar optimized implementation of BLAKE2s. NEON isn't very useful for BLAKE2s because the BLAKE2s block size is too small for NEON to help. Each NEON instruction would depend on the previous one, resulting in poor performance. With scalar instructions, on the other hand, we can take advantage of ARM's "free" rotations (like I did in chacha-scalar-core.S) to get an implementation get runs much faster than the C implementation. Performance results on Cortex-A7 in cycles per byte using the shash API: 4096-byte messages: blake2s-256-arm: 18.8 blake2s-256-generic: 26.0 500-byte messages: blake2s-256-arm: 20.3 blake2s-256-generic: 27.9 100-byte messages: blake2s-256-arm: 29.7 blake2s-256-generic: 39.2 32-byte messages: blake2s-256-arm: 50.6 blake2s-256-generic: 66.2 Except on very short messages, this is still slower than the NEON implementation of BLAKE2b which I've written; that is 14.0, 16.4, 25.8, and 76.1 cpb on 4096, 500, 100, and 32-byte messages, respectively. However, optimized BLAKE2s is useful for cases where BLAKE2s is used instead of BLAKE2b, such as WireGuard. This new implementation is added in the form of a new module blake2s-arm.ko, which is analogous to blake2s-x86_64.ko in that it provides blake2s_compress_arch() for use by the library API as well as optionally register the algorithms with the shash API. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Tested-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: blake2s - share the "shash" API boilerplate codeEric Biggers1-67/+7
Add helper functions for shash implementations of BLAKE2s to include/crypto/internal/blake2s.h, taking advantage of __blake2s_update() and __blake2s_final() that were added by the previous patch to share more code between the library and shash implementations. crypto_blake2s_setkey() and crypto_blake2s_init() are usable as shash_alg::setkey and shash_alg::init directly, while crypto_blake2s_update() and crypto_blake2s_final() take an extra 'blake2s_compress_t' function pointer parameter. This allows the implementation of the compression function to be overridden, which is the only part that optimized implementations really care about. The new functions are inline functions (similar to those in sha1_base.h, sha256_base.h, and sm3_base.h) because this avoids needing to add a new module blake2s_helpers.ko, they aren't *too* long, and this avoids indirect calls which are expensive these days. Note that they can't go in blake2s_generic.ko, as that would require selecting CRYPTO_BLAKE2S from CRYPTO_BLAKE2S_X86, which would cause a recursive dependency. Finally, use these new helper functions in the x86 implementation of BLAKE2s. (This part should be a separate patch, but unfortunately the x86 implementation used the exact same function names like "crypto_blake2s_update()", so it had to be updated at the same time.) Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: x86/blake2s - define shash_alg structs using macrosEric Biggers1-61/+23
The shash_alg structs for the four variants of BLAKE2s are identical except for the algorithm name, driver name, and digest size. So, avoid code duplication by using a macro to define these structs. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: arm64/aes-ctr - improve tail handlingArd Biesheuvel2-74/+137
Counter mode is a stream cipher chaining mode that is typically used with inputs that are of arbitrarily length, and so a tail block which is smaller than a full AES block is rule rather than exception. The current ctr(aes) implementation for arm64 always makes a separate call into the assembler routine to process this tail block, which is suboptimal, given that it requires reloading of the AES round keys, and prevents us from handling this tail block using the 5-way stride that we use for better performance on deep pipelines. So let's update the assembler routine so it can handle any input size, and uses NEON permutation instructions and overlapping loads and stores to handle the tail block. This results in a ~16% speedup for 1420 byte blocks on cores with deep pipelines such as ThunderX2. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: arm64/aes-ce - really hide slower algos when faster ones are enabledArd Biesheuvel1-2/+2
Commit 69b6f2e817e5b ("crypto: arm64/aes-neon - limit exposed routines if faster driver is enabled") intended to hide modes from the plain NEON driver that are also implemented by the faster bit sliced NEON one if both are enabled. However, the defined() CPP function does not detect if the bit sliced NEON driver is enabled as a module. So instead, let's use IS_ENABLED() here. Fixes: 69b6f2e817e5b ("crypto: arm64/aes-neon - limit exposed routines if ...") Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: remove cipher routines from public crypto APIArd Biesheuvel2-0/+5
The cipher routines in the crypto API are mostly intended for templates implementing skcipher modes generically in software, and shouldn't be used outside of the crypto subsystem. So move the prototypes and all related definitions to a new header file under include/crypto/internal. Also, let's use the new module namespace feature to move the symbol exports into a new namespace CRYPTO_INTERNAL. Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: aesni - implement support for cts(cbc(aes))Ard Biesheuvel2-1/+261
Follow the same approach as the arm64 driver for implementing a version of AES-NI in CBC mode that supports ciphertext stealing. This results in a ~2x speed increase for relatively short inputs (less than 256 bytes), which is relevant given that AES-CBC with ciphertext stealing is used for filename encryption in the fscrypt layer. For larger inputs, the speedup is still significant (~25% on decryption, ~6% on encryption) Tested-by: Eric Biggers <ebiggers@google.com> # x86_64 Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2021-01-03crypto: arm/chacha-neon - add missing counter incrementArd Biesheuvel1-0/+1
Commit 86cd97ec4b943af3 ("crypto: arm/chacha-neon - optimize for non-block size multiples") refactored the chacha block handling in the glue code in a way that may result in the counter increment to be omitted when calling chacha_block_xor_neon() to process a full block. This violates the skcipher API, which requires that the output IV is suitable for handling more input as long as the preceding input has been presented in round multiples of the block size. Also, the same code is exposed via the chacha library interface whose callers may actually rely on this increment to occur even for final blocks that are smaller than the chacha block size. So increment the counter after calling chacha_block_xor_neon(). Fixes: 86cd97ec4b943af3 ("crypto: arm/chacha-neon - optimize for non-block size multiples") Reported-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-12-24Merge tag 'riscv-for-linus-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linuxLinus Torvalds1-1/+1
Pull RISC-V fix from Palmer Dabbelt "Avoid trying to initialize memory regions outside the usable range" * tag 'riscv-for-linus-5.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: RISC-V: Fix usage of memblock_enforce_memory_limit
2020-12-24Merge tag 'powerpc-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linuxLinus Torvalds7-15/+29
Pull powerpc fixes from Michael Ellerman: - Four commits fixing various things in the new C VDSO code - One fix for a 32-bit VMAP stack bug - Two minor build fixes Thanks to Cédric Le Goater, Christophe Leroy, and Will Springer. * tag 'powerpc-5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/32: Fix vmap stack - Properly set r1 before activating MMU on syscall too powerpc/vdso: Fix DOTSYM for 32-bit LE VDSO powerpc/vdso: Don't pass 64-bit ABI cflags to 32-bit VDSO powerpc/vdso: Block R_PPC_REL24 relocations powerpc/smp: Add __init to init_big_cores() powerpc/time: Force inlining of get_tb() powerpc/boot: Fix build of dts/fsl
2020-12-24Merge tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds6-9/+7
Pull irq updates from Thomas Gleixner: "This is the second attempt after the first one failed miserably and got zapped to unblock the rest of the interrupt related patches. A treewide cleanup of interrupt descriptor (ab)use with all sorts of racy accesses, inefficient and disfunctional code. The goal is to remove the export of irq_to_desc() to prevent these things from creeping up again" * tag 'irq-core-2020-12-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits) genirq: Restrict export of irq_to_desc() xen/events: Implement irq distribution xen/events: Reduce irq_info:: Spurious_cnt storage size xen/events: Only force affinity mask for percpu interrupts xen/events: Use immediate affinity setting xen/events: Remove disfunct affinity spreading xen/events: Remove unused bind_evtchn_to_irq_lateeoi() net/mlx5: Use effective interrupt affinity net/mlx5: Replace irq_to_desc() abuse net/mlx4: Use effective interrupt affinity net/mlx4: Replace irq_to_desc() abuse PCI: mobiveil: Use irq_data_get_irq_chip_data() PCI: xilinx-nwl: Use irq_data_get_irq_chip_data() NTB/msi: Use irq_has_action() mfd: ab8500-debugfs: Remove the racy fiddling with irq_desc pinctrl: nomadik: Use irq_has_action() drm/i915/pmu: Replace open coded kstat_irqs() copy drm/i915/lpe_audio: Remove pointless irq_to_desc() usage s390/irq: Use irq_desc_kstat_cpu() in show_msi_interrupt() parisc/irq: Use irq_desc_kstat_cpu() in show_interrupts() ...
2020-12-24Merge tag 'efi_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds9-152/+30
Pull EFI updates from Borislav Petkov: "These got delayed due to a last minute ia64 build issue which got fixed in the meantime. EFI updates collected by Ard Biesheuvel: - Don't move BSS section around pointlessly in the x86 decompressor - Refactor helper for discovering the EFI secure boot mode - Wire up EFI secure boot to IMA for arm64 - Some fixes for the capsule loader - Expose the RT_PROP table via the EFI test module - Relax DT and kernel placement restrictions on ARM with a few followup fixes: - fix the build breakage on IA64 caused by recent capsule loader changes - suppress a type mismatch build warning in the expansion of EFI_PHYS_ALIGN on ARM" * tag 'efi_updates_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: efi: arm: force use of unsigned type for EFI_PHYS_ALIGN efi: ia64: disable the capsule loader efi: stub: get rid of efi_get_max_fdt_addr() efi/efi_test: read RuntimeServicesSupported efi: arm: reduce minimum alignment of uncompressed kernel efi: capsule: clean scatter-gather entries from the D-cache efi: capsule: use atomic kmap for transient sglist mappings efi: x86/xen: switch to efi_get_secureboot_mode helper arm64/ima: add ima_arch support ima: generalize x86/EFI arch glue for other EFI architectures efi: generalize efi_get_secureboot efi/libstub: EFI_GENERIC_STUB_INITRD_CMDLINE_LOADER should not default to yes efi/x86: Only copy the compressed kernel image in efi_relocate_kernel() efi/libstub/x86: simplify efi_is_native()
2020-12-22Merge tag 'kbuild-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuildLinus Torvalds1-1/+1
Pull Kbuild updates from Masahiro Yamada: - Use /usr/bin/env for shebang lines in scripts - Remove useless -Wnested-externs warning flag - Update documents - Refactor log handling in modpost - Stop building modules without MODULE_LICENSE() tag - Make the insane combination of 'static' and EXPORT_SYMBOL an error - Improve genksyms to handle _Static_assert() * tag 'kbuild-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: Documentation/kbuild: Document platform dependency practises Documentation/kbuild: Document COMPILE_TEST dependencies genksyms: Ignore module scoped _Static_assert() modpost: turn static exports into error modpost: turn section mismatches to error from fatal() modpost: change license incompatibility to error() from fatal() modpost: turn missing MODULE_LICENSE() into error modpost: refactor error handling and clarify error/fatal difference modpost: rename merror() to error() kbuild: don't hardcode depmod path kbuild: doc: document subdir-y syntax kbuild: doc: clarify the difference between extra-y and always-y kbuild: doc: split if_changed explanation to a separate section kbuild: doc: merge 'Special Rules' and 'Custom kbuild commands' sections kbuild: doc: fix 'List directories to visit when descending' section kbuild: doc: replace arch/$(ARCH)/ with arch/$(SRCARCH)/ kbuild: doc: update the description about kbuild Makefiles Makefile.extrawarn: remove -Wnested-externs warning tweewide: Fix most Shebang lines
2020-12-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds36-51/+466
Merge KASAN updates from Andrew Morton. This adds a new hardware tag-based mode to KASAN. The new mode is similar to the existing software tag-based KASAN, but relies on arm64 Memory Tagging Extension (MTE) to perform memory and pointer tagging (instead of shadow memory and compiler instrumentation). By Andrey Konovalov and Vincenzo Frascino. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (60 commits) kasan: update documentation kasan, mm: allow cache merging with no metadata kasan: sanitize objects when metadata doesn't fit kasan: clarify comment in __kasan_kfree_large kasan: simplify assign_tag and set_tag calls kasan: don't round_up too much kasan, mm: rename kasan_poison_kfree kasan, mm: check kasan_enabled in annotations kasan: add and integrate kasan boot parameters kasan: inline (un)poison_range and check_invalid_free kasan: open-code kasan_unpoison_slab kasan: inline random_tag for HW_TAGS kasan: inline kasan_reset_tag for tag-based modes kasan: remove __kasan_unpoison_stack kasan: allow VMAP_STACK for HW_TAGS mode kasan, arm64: unpoison stack only with CONFIG_KASAN_STACK kasan: introduce set_alloc_info kasan: rename get_alloc/free_info kasan: simplify quarantine_put call site kselftest/arm64: check GCR_EL1 after context switch ...
2020-12-22Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linuxLinus Torvalds50-473/+1256
Pull ARM updates from Russell King: - Rework phys/virt translation - Add KASan support - Move DT out of linear map region - Use more PC-relative addressing in assembly - Remove FP emulation handling while in kernel mode - Link with '-z norelro' - remove old check for GCC <= 4.2 in ARM unwinder code - disable big endian if using clang's linker * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm: (46 commits) ARM: 9027/1: head.S: explicitly map DT even if it lives in the first physical section ARM: 9038/1: Link with '-z norelro' ARM: 9037/1: uncompress: Add OF_DT_MAGIC macro ARM: 9036/1: uncompress: Fix dbgadtb size parameter name ARM: 9035/1: uncompress: Add be32tocpu macro ARM: 9033/1: arm/smp: Drop the macro S(x,s) ARM: 9032/1: arm/mm: Convert PUD level pgtable helper macros into functions ARM: 9031/1: hyp-stub: remove unused .L__boot_cpu_mode_offset symbol ARM: 9044/1: vfp: use undef hook for VFP support detection ARM: 9034/1: __div64_32(): straighten up inline asm constraints ARM: 9030/1: entry: omit FP emulation for UND exceptions taken in kernel mode ARM: 9029/1: Make iwmmxt.S support Clang's integrated assembler ARM: 9028/1: disable KASAN in call stack capturing routines ARM: 9026/1: unwind: remove old check for GCC <= 4.2 ARM: 9025/1: Kconfig: CPU_BIG_ENDIAN depends on !LD_IS_LLD ARM: 9024/1: Drop useless cast of "u64" to "long long" ARM: 9023/1: Spelling s/mmeory/memory/ ARM: 9022/1: Change arch/arm/lib/mem*.S to use WEAK instead of .weak ARM: kvm: replace open coded VA->PA calculations with adr_l call ARM: head.S: use PC relative insn sequence to calculate PHYS_OFFSET ...
2020-12-22Merge tag 'dma-mapping-5.11' of git://git.infradead.org/users/hch/dma-mappingLinus Torvalds3-12/+111
Pull dma-mapping updates from Christoph Hellwig: - support for a partial IOMMU bypass (Alexey Kardashevskiy) - add a DMA API benchmark (Barry Song) - misc fixes (Tiezhu Yang, tangjianqiang) * tag 'dma-mapping-5.11' of git://git.infradead.org/users/hch/dma-mapping: selftests/dma: add test application for DMA_MAP_BENCHMARK dma-mapping: add benchmark support for streaming DMA APIs dma-contiguous: fix a typo error in a comment dma-pool: no need to check return value of debugfs_create functions powerpc/dma: Fallback to dma_ops when persistent memory present dma-mapping: Allow mixing bypass and mapped DMA operation
2020-12-22x86/split-lock: Avoid returning with interrupts enabledAndi Kleen1-1/+2
When a split lock is detected always make sure to disable interrupts before returning from the trap handler. The kernel exit code assumes that all exits run with interrupts disabled, otherwise the SWAPGS sequence can race against interrupts and cause recursing page faults and later panics. The problem will only happen on CPUs with split lock disable functionality, so Icelake Server, Tiger Lake, Snow Ridge, Jacobsville. Fixes: ca4c6a9858c2 ("x86/traps: Make interrupt enable/disable symmetric in C code") Fixes: bce9b042ec73 ("x86/traps: Disable interrupts in exc_aligment_check()") # v5.8+ Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Tony Luck <tony.luck@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-12-22kasan: allow VMAP_STACK for HW_TAGS modeAndrey Konovalov1-4/+4
Even though hardware tag-based mode currently doesn't support checking vmalloc allocations, it doesn't use shadow memory and works with VMAP_STACK as is. Change VMAP_STACK definition accordingly. Link: https://lkml.kernel.org/r/ecdb2a1658ebd88eb276dee2493518ac0e82de41.1606162397.git.andreyknvl@google.com Link: https://linux-review.googlesource.com/id/I3552cbc12321dec82cd7372676e9372a2eb452ac Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>