aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/src/crypto (follow)
Commit message (Collapse)AuthorAgeFilesLines
* crypto: curve25519-x86_64: use in/out register constraints more preciselyJason A. Donenfeld2021-12-131-293/+504
| | | | | | | | | | | | | | | | | Rather than passing all variables as modified, pass ones that are only read into that parameter. This helps with old gcc versions when alternatives are additionally used, and lets gcc's codegen be a little bit more efficient. This also syncs up with the latest Vale/EverCrypt output. This also forward ports 3c9f3b6 ("crypto: curve25519-x86_64: solve register constraints with reserved registers"). Cc: Aymeric Fromherz <aymeric.fromherz@inria.fr> Cc: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/wireguard/1554725710.1290070.1639240504281.JavaMail.zimbra@inria.fr/ Link: https://github.com/project-everest/hacl-star/pull/501 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* crypto: curve25519-x86_64: solve register constraints with reserved registersMathias Krause2021-12-061-4/+4
| | | | | | | | | | | | | | | | | | | | | The register constraints for the inline assembly in fsqr() and fsqr2() are pretty tight on what the compiler may assign to the remaining three register variables. The clobber list only allows the following to be used: RDI, RSI, RBP and R12. With RAP reserving R12 and a kernel having CONFIG_FRAME_POINTER=y, claiming RBP, there are only two registers left so the compiler rightfully complains about impossible constraints. Provide alternatives that'll allow a memory reference for 'out' to solve the allocation constraint dilemma for this configuration. Also make 'out' an input-only operand as it is only used as such. This not only allows gcc to optimize its usage further, but also works around older gcc versions, apparently failing to handle multiple alternatives correctly, as in failing to initialize the 'out' operand with its input value. Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* compat: remove unused version.h headersJason A. Donenfeld2021-02-071-1/+0
| | | | | | We don't need this in all files, and it just complicates things. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* crypto: do not export symbolsJason A. Donenfeld2020-04-145-19/+0
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* curve25519-x86_64: avoid use of r12Jason A. Donenfeld2020-02-191-55/+55
| | | | | | | | | | | This causes problems with RAP and KERNEXEC for PaX, as r12 is a reserved register. It also leads to a more compact instruction encoding, saving about 100 cycles. Suggested-by: PaX Team <pageexec@freemail.hu> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20poly1305: defensively protect against large inputsJason A. Donenfeld2020-02-061-1/+3
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* curve25519: x86_64: replace with formally verified implementationJason A. Donenfeld2020-01-213-2308/+1300
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This comes from INRIA's HACL*/Vale. It implements the same algorithm and implementation strategy as the code it replaces, only this code has been formally verified, sans the base point multiplication, which uses code similar to prior, only it uses the formally verified field arithmetic alongside reproducable ladder generation steps. This doesn't have a pure-bmi2 version, which means haswell no longer benefits, but the increased (doubled) code complexity is not worth it for a single generation of chips that's already old. Performance-wise, this is around 1% slower on older microarchitectures, and slightly faster on newer microarchitectures, mainly 10nm ones or backports of 10nm to 14nm. This implementation is "everest" below: Xeon E5-2680 v4 (Broadwell) armfazh: 133340 cycles per call everest: 133436 cycles per call Xeon Gold 5120 (Sky Lake Server) armfazh: 112636 cycles per call everest: 113906 cycles per call Core i5-6300U (Sky Lake Client) armfazh: 116810 cycles per call everest: 117916 cycles per call Core i7-7600U (Kaby Lake) armfazh: 119523 cycles per call everest: 119040 cycles per call Core i7-8750H (Coffee Lake) armfazh: 113914 cycles per call everest: 113650 cycles per call Core i9-9880H (Coffee Lake Refresh) armfazh: 112616 cycles per call everest: 114082 cycles per call Core i3-8121U (Cannon Lake) armfazh: 113202 cycles per call everest: 111382 cycles per call Core i7-8265U (Whiskey Lake) armfazh: 127307 cycles per call everest: 127697 cycles per call Core i7-8550U (Kaby Lake Refresh) armfazh: 127522 cycles per call everest: 127083 cycles per call Xeon Platinum 8275CL (Cascade Lake) armfazh: 114380 cycles per call everest: 114656 cycles per call Achieving these kind of results with formally verified code is quite remarkable, especialy considering that performance is favorable for newer chips. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* global: fix up spellingJosh Soref2019-12-121-1/+1
| | | | | Signed-off-by: Josh Soref <jsoref@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20poly1305: double check the sgmiter logic with testJason A. Donenfeld2019-12-061-8/+59
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* crypto: use new assembler macros for 5.5Jason A. Donenfeld2019-12-055-14/+14
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20poly1305: port to sgmitter for 5.5Jason A. Donenfeld2019-12-053-114/+143
| | | | | | | I'm not totally comfortable with these changes yet, and it'll require some more scrutiny. But it's a start. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* blake2s: spacingJason A. Donenfeld2019-06-032-123/+123
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* curve25519: not all linkers support bmi2 and adxJason A. Donenfeld2019-06-022-6/+48
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* blake2s: add ssse3 to nobsJason A. Donenfeld2019-05-311-1/+2
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* blake2s: do not use xgetbv for ssse3 detectionJason A. Donenfeld2019-05-311-3/+1
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* zinc: update copyrightJason A. Donenfeld2019-05-292-2/+2
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* blake2s: shorten ssse3 loopSamuel Neves2019-05-291-857/+66
| | | | | | | | This (mostly) preserves the performance (as measured on Haswell and *lake) of last commit, but it drastically reduces code size. Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* blake2s,chacha: latency tweakSamuel Neves2019-05-295-618/+982
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In every odd-numbered round, instead of operating over the state x00 x01 x02 x03 x05 x06 x07 x04 x10 x11 x08 x09 x15 x12 x13 x14 we operate over the rotated state x03 x00 x01 x02 x04 x05 x06 x07 x09 x10 x11 x08 x14 x15 x12 x13 The advantage here is that this requires no changes to the 'x04 x05 x06 x07' row, which is in the critical path. This results in a noticeable latency improvement of roughly R cycles, for R diagonal rounds in the primitive. In the case of BLAKE2s, which I also moved from requiring AVX to only requiring SSSE3, we save approximately 30 cycles per compression function call on Haswell and Skylake. In other words, this is an improvement of ~0.6 cpb. This idea was pointed out to me by Shunsuke Shimizu, though it appears to have been around for longer. Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* zinc: arm64: use cpu_get_elf_hwcap accessor for 5.2Jason A. Donenfeld2019-05-292-2/+2
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* kbuild: account for recent upstream changesJason A. Donenfeld2019-05-291-1/+1
| | | | | | | | Apparently cdd750bfb1f76fe9be8cfb53cbe77b2e811081ab changed things, so we fall back onto this hack. Reported-by: Alex Xu <alex@alxu.ca> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* blake2s: remove outlen parameter from finalJason A. Donenfeld2019-03-272-8/+7
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* blake2s: simplifySamuel Neves2019-03-272-40/+12
| | | | | Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20: name enumsJason A. Donenfeld2019-02-041-2/+2
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* noise: store clamped key instead of raw keyJason A. Donenfeld2019-02-035-14/+13
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20poly1305: permit unaligned strides on certain platformsJason A. Donenfeld2019-02-031-18/+14
| | | | | | | | The map allocations required to fix this are mostly slower than unaligned paths. Reported-by: Louis Sautier <sbraz@gentoo.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* global: normalize -> clampJason A. Donenfeld2019-01-234-17/+10
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* global: update copyrightJason A. Donenfeld2019-01-0737-37/+37
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* makefile: use immediate expansion and use correct template patternsJason A. Donenfeld2018-12-181-6/+6
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20: do not define unused asm functionJason A. Donenfeld2018-12-071-4/+2
| | | | | | | This causes RAP to be unhappy, and we're not using it anyway. Reported-by: Ivan J. <parazyd@dyne.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20,poly1305: simplify perlasm fancinessJason A. Donenfeld2018-12-073-75/+69
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20,poly1305: do not use xlateJason A. Donenfeld2018-11-193-1496/+73
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* poly1305: make frame pointers for auxiliary callsSamuel Neves2018-11-171-31/+43
| | | | | Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* crypto: better path resolution and more specific generated .SJason A. Donenfeld2018-11-161-6/+5
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20,poly1305: don't do compiler testing in generator and remove xor helperJason A. Donenfeld2018-11-152-30/+39
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* crypto: resolve target prefix on buggy kernelsJason A. Donenfeld2018-11-151-1/+6
| | | | | | | We also move to .SECONDARY, since older kernels don't use targets like that. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* poly1305: cleanup leftover debugging changesJason A. Donenfeld2018-11-151-3/+3
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* poly1305: only export neon symbols when in useJason A. Donenfeld2018-11-151-2/+6
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20,poly1305: fix up for win64Samuel Neves2018-11-152-27/+29
| | | | | | | | These don't help us, but it is important to keep this working for when it's re-added to cryptogams. Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* perlasm: avoid rep retJason A. Donenfeld2018-11-151-1/+1
| | | | | | | | | | The original hardcodes returns as .byte 0xf3,0xc3, aka "rep ret". We replace this by "ret". "rep ret" was meant to help with AMD K8 chips, cf. http://repzret.org/p/repzret. It makes no sense to continue to use this kludge for code that won't even run on ancient AMD chips. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* poly1305: specialize to wireguardJason A. Donenfeld2018-11-151-11/+20
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20: specialize to wireguardJason A. Donenfeld2018-11-152-20/+38
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* perlasm: cleanup whitespaceJason A. Donenfeld2018-11-151-5/+5
| | | | Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* poly1305: adjust to kernelSamuel Neves2018-11-151-220/+291
| | | | | Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20: cleaner function declarationsSamuel Neves2018-11-141-23/+23
| | | | | Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20: normalize namesSamuel Neves2018-11-141-71/+71
| | | | | Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20: fixup win64 stack offsetsSamuel Neves2018-11-141-129/+129
| | | | | | | We don't need to do this for kernel purposes, but it's polite to leave things unbroken. Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20: simplify stack unwinding on ChaCha20_ctr32Samuel Neves2018-11-141-10/+8
| | | | | | | objtool did not quite understand the stack arithmetic employed here. Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20: use DRAP idiomSamuel Neves2018-11-141-236/+235
| | | | | | | This effectively means swapping the usage of %r9 and %r10 globally. Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20: add hchacha_ssse3Samuel Neves2018-11-141-0/+39
| | | | | Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
* chacha20: begin adapting to kernel settingSamuel Neves2018-11-142-68/+116
| | | | | Signed-off-by: Samuel Neves <sneves@dei.uc.pt> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>