arm64/lib: improve CRC32 performance for deep pipelines - linux-dev - Linux kernel development work

diff options

author	Ard Biesheuvel <ard.biesheuvel@linaro.org>	2018-11-27 18:42:55 +0100
committer	Will Deacon <will.deacon@arm.com>	2018-11-30 13:58:04 +0000
commit	efdb25efc7645b326cd5eb82be5feeabe167c24e (patch)
tree	fca2b4c8aa400a78212468e40211689723bd6957 /arch/arm64/Makefile
parent	arm64: ftrace: always pass instrumented pc in x0 (diff)
download	linux-dev-efdb25efc7645b326cd5eb82be5feeabe167c24e.tar.xz linux-dev-efdb25efc7645b326cd5eb82be5feeabe167c24e.zip

arm64/lib: improve CRC32 performance for deep pipelines

Improve the performance of the crc32() asm routines by getting rid of most of the branches and small sized loads on the common path. Instead, use a branchless code path involving overlapping 16 byte loads to process the first (length % 32) bytes, and process the remainder using a loop that processes 32 bytes at a time. Tested using the following test program: #include <stdlib.h> extern void crc32_le(unsigned short, char const*, int); int main(void) { static const char buf[4096]; srand(20181126); for (int i = 0; i < 100 * 1000 * 1000; i++) crc32_le(0, buf, rand() % 1024); return 0; } On Cortex-A53 and Cortex-A57, the performance regresses but only very slightly. On Cortex-A72 however, the performance improves from $ time ./crc32 real 0m10.149s user 0m10.149s sys 0m0.000s to $ time ./crc32 real 0m7.915s user 0m7.915s sys 0m0.000s Cc: Rui Sun <sunrui26@huawei.com> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>

Diffstat (limited to 'arch/arm64/Makefile')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: