aboutsummaryrefslogtreecommitdiffstats
path: root/arch/powerpc/lib/memcpy_power7.S (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 156Thomas Gleixner1-13/+1
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not write to the free software foundation inc 59 temple place suite 330 boston ma 02111 1307 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 1334 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-08selftests/powerpc/64: Test all paths through copy routinesPaul Mackerras1-11/+11
The hand-coded assembler 64-bit copy routines include feature sections that select one code path or another depending on which CPU we are executing on. The self-tests for these copy routines end up testing just one path. This adds a mechanism for selecting any desired code path at compile time, and makes 2 or 3 versions of each test, each using a different code path, so as to cover all the possible paths. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> [mpe: Add -mcpu=power4 to CFLAGS for older compilers] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-07-24powerpc/64: enhance memcmp() with VMX instruction for long bytes comparisionSimon Guo1-3/+3
This patch add VMX primitives to do memcmp() in case the compare size is equal or greater than 4K bytes. KSM feature can benefit from this. Test result with following test program(replace the "^>" with ""): ------ ># cat tools/testing/selftests/powerpc/stringloops/memcmp.c >#include <malloc.h> >#include <stdlib.h> >#include <string.h> >#include <time.h> >#include "utils.h" >#define SIZE (1024 * 1024 * 900) >#define ITERATIONS 40 int test_memcmp(const void *s1, const void *s2, size_t n); static int testcase(void) { char *s1; char *s2; unsigned long i; s1 = memalign(128, SIZE); if (!s1) { perror("memalign"); exit(1); } s2 = memalign(128, SIZE); if (!s2) { perror("memalign"); exit(1); } for (i = 0; i < SIZE; i++) { s1[i] = i & 0xff; s2[i] = i & 0xff; } for (i = 0; i < ITERATIONS; i++) { int ret = test_memcmp(s1, s2, SIZE); if (ret) { printf("return %d at[%ld]! should have returned zero\n", ret, i); abort(); } } return 0; } int main(void) { return test_harness(testcase, "memcmp"); } ------ Without this patch (but with the first patch "powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp()." in the series): 4.726728762 seconds time elapsed ( +- 3.54%) With VMX patch: 4.234335473 seconds time elapsed ( +- 2.63%) There is ~+10% improvement. Testing with unaligned and different offset version (make s1 and s2 shift random offset within 16 bytes) can archieve higher improvement than 10%.. Signed-off-by: Simon Guo <wei.guo.simon@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2018-04-01powerpc/64s: Set assembler machine type to POWER4Nicholas Piggin1-3/+0
Rather than override the machine type in .S code (which can hide wrong or ambiguous code generation for the target), set the type to power4 for all assembly. This also means we need to be careful not to build power4-only code when we're not building for Book3S, such as the "power7" versions of copyuser/page/memcpy. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Fix Book3E build, don't build the "power7" variants for non-Book3S] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-10powerpc: Fix invalid use of register expressionsAndreas Schwab1-33/+33
binutils >= 2.26 now warns about misuse of register expressions in assembler operands that are actually literals, for example: arch/powerpc/kernel/entry_64.S:535: Warning: invalid register expression In practice these are almost all uses of r0 that should just be a literal 0. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> [mpe: Mention r0 is almost always the culprit, fold in purgatory change] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-16powerpc: Change vrX register defines to vX to match gcc and glibcAnton Blanchard1-113/+113
As our various loops (copy, string, crypto etc) get more complicated, we want to share implementations between userspace (eg glibc) and the kernel. We also want to write userspace test harnesses to put in tools/testing/selftest. One gratuitous difference between userspace and the kernel is the VMX register definitions - the kernel uses vrX whereas both gcc and glibc use vX. Change the kernel to match userspace. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-10-29powerpc: Fix comment typos 'CONFiG_ALTIVEC'Paul Bolle1-1/+1
Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-04-23powerpc: Fix unsafe accesses to parameter area in ELFv2Ulrich Weigand1-10/+10
Some of the assembler files in lib/ make use of the fact that in the ELFv1 ABI, the caller guarantees to provide stack space to save the parameter registers r3 ... r10. This guarantee is no longer present in ELFv2 for functions that have no variable argument list and no more than 8 arguments. Change the affected routines to temporarily store registers in the red zone and/or the top of their own stack frame (in the space provided to save r31 .. r29, which is actually not used in these routines). In opal_query_takeover, simply always allocate a stack frame; the routine is not performance critical. Signed-off-by: Ulrich Weigand <ulrich.weigand@de.ibm.com> Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23powerpc: Fix ABIv2 issues with stack offsets in assembly codeAnton Blanchard1-10/+10
Fix STK_PARAM and use it instead of hardcoding ABIv1 offsets. Signed-off-by: Anton Blanchard <anton@samba.org>
2014-04-23powerpc: No need to use dot symbols when branching to a functionAnton Blanchard1-3/+3
binutils is smart enough to know that a branch to a function descriptor is actually a branch to the functions text address. Alan tells me that binutils has been doing this for 9 years. Signed-off-by: Anton Blanchard <anton@samba.org>
2013-10-11powerpc: Fix endian issues in VMX copy loopsAnton Blanchard1-23/+32
Fix the permute loops for little endian. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-10-04powerpc: Fix VMX fix for memcpy caseNishanth Aravamudan1-2/+2
In 2fae7cdb60240e2e2d9b378afbf6d9fcce8a3890 ("powerpc: Fix VMX in interrupt check in POWER7 copy loops"), Anton inadvertently introduced a regression for memcpy on POWER7 machines. copyuser and memcpy diverge slightly in their use of cr1 (copyuser doesn't use it, but memcpy does) and you end up clobbering that register with your fix. That results in (taken from an FC18 kernel): [ 18.824604] Unrecoverable VMX/Altivec Unavailable Exception f20 at c000000000052f40 [ 18.824618] Oops: Unrecoverable VMX/Altivec Unavailable Exception, sig: 6 [#1] [ 18.824623] SMP NR_CPUS=1024 NUMA pSeries [ 18.824633] Modules linked in: tg3(+) be2net(+) cxgb4(+) ipr(+) sunrpc xts lrw gf128mul dm_crypt dm_round_robin dm_multipath linear raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua squashfs cramfs [ 18.824705] NIP: c000000000052f40 LR: c00000000020b874 CTR: 0000000000000512 [ 18.824709] REGS: c000001f1fef7790 TRAP: 0f20 Not tainted (3.6.0-0.rc6.git0.2.fc18.ppc64) [ 18.824713] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI> CR: 4802802e XER: 20000010 [ 18.824726] SOFTE: 0 [ 18.824728] CFAR: 0000000000000f20 [ 18.824731] TASK = c000000fa7128400[0] 'swapper/24' THREAD: c000000fa7480000 CPU: 24 GPR00: 00000000ffffffc0 c000001f1fef7a10 c00000000164edc0 c000000f9b9a8120 GPR04: c000000f9b9a8124 0000000000001438 0000000000000060 03ffffff064657ee GPR08: 0000000080000000 0000000000000010 0000000000000020 0000000000000030 GPR12: 0000000028028022 c00000000ff25400 0000000000000001 0000000000000000 GPR16: 0000000000000000 7fffffffffffffff c0000000016b2180 c00000000156a500 GPR20: c000000f968c7a90 c0000000131c31d8 c000001f1fef4000 c000000001561d00 GPR24: 000000000000000a 0000000000000000 0000000000000001 0000000000000012 GPR28: c000000fa5c04f80 00000000000008bc c0000000015c0a28 000000000000022e [ 18.824792] NIP [c000000000052f40] .memcpy_power7+0x5a0/0x7c4 [ 18.824797] LR [c00000000020b874] .pcpu_free_area+0x174/0x2d0 [ 18.824800] Call Trace: [ 18.824803] [c000001f1fef7a10] [c000000000052c14] .memcpy_power7+0x274/0x7c4 (unreliable) [ 18.824809] [c000001f1fef7b10] [c00000000020b874] .pcpu_free_area+0x174/0x2d0 [ 18.824813] [c000001f1fef7bb0] [c00000000020ba88] .free_percpu+0xb8/0x1b0 [ 18.824819] [c000001f1fef7c50] [c00000000043d144] .throtl_pd_exit+0x94/0xd0 [ 18.824824] [c000001f1fef7cf0] [c00000000043acf8] .blkg_free+0x88/0xe0 [ 18.824829] [c000001f1fef7d90] [c00000000018c048] .rcu_process_callbacks+0x2e8/0x8a0 [ 18.824835] [c000001f1fef7e90] [c0000000000a8ce8] .__do_softirq+0x158/0x4d0 [ 18.824840] [c000001f1fef7f90] [c000000000025ecc] .call_do_softirq+0x14/0x24 [ 18.824845] [c000000fa7483650] [c000000000010e80] .do_softirq+0x160/0x1a0 [ 18.824850] [c000000fa74836f0] [c0000000000a94a4] .irq_exit+0xf4/0x120 [ 18.824854] [c000000fa7483780] [c000000000020c44] .timer_interrupt+0x154/0x4d0 [ 18.824859] [c000000fa7483830] [c000000000003be0] decrementer_common+0x160/0x180 [ 18.824866] --- Exception: 901 at .plpar_hcall_norets+0x84/0xd4 [ 18.824866] LR = .check_and_cede_processor+0x48/0x80 [ 18.824871] [c000000fa7483b20] [c00000000007f018] .check_and_cede_processor+0x18/0x80 (unreliable) [ 18.824877] [c000000fa7483b90] [c00000000007f104] .dedicated_cede_loop+0x84/0x150 [ 18.824883] [c000000fa7483c50] [c0000000006bc030] .cpuidle_enter+0x30/0x50 [ 18.824887] [c000000fa7483cc0] [c0000000006bc9f4] .cpuidle_idle_call+0x104/0x720 [ 18.824892] [c000000fa7483d80] [c000000000070af8] .pSeries_idle+0x18/0x40 [ 18.824897] [c000000fa7483df0] [c000000000019084] .cpu_idle+0x1a4/0x380 [ 18.824902] [c000000fa7483ec0] [c0000000008a4c18] .start_secondary+0x520/0x528 [ 18.824907] [c000000fa7483f90] [c0000000000093f0] .start_secondary_prolog+0x10/0x14 [ 18.824911] Instruction dump: [ 18.824914] 38840008 90030000 90e30004 38630008 7ca62850 7cc300d0 78c7e102 7cf01120 [ 18.824923] 78c60660 39200010 39400020 39600030 <7e00200c> 7c0020ce 38840010 409f001c [ 18.824935] ---[ end trace 0bb95124affaaa45 ]--- [ 18.825046] Unrecoverable VMX/Altivec Unavailable Exception f20 at c000000000052d08 I believe the right fix is to make memcpy match usercopy and not use cr1. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: <stable@kernel.org> [v3.6]
2012-08-24powerpc: Fix VMX in interrupt check in POWER7 copy loopsAnton Blanchard1-2/+2
The enhanced prefetch hint patches corrupt the condition register that was used to check if we are in interrupt. Fix this by using cr1. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-10powerpc: Merge STK_REG/PARAM/FRAMESIZEMichael Neuling1-3/+0
Merge the defines of STACKFRAMESIZE, STK_REG, STK_PARAM from different places. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-10powerpc: Fix usage of register macros getting ready for %r0 changeMichael Neuling1-30/+30
Anything that uses a constructed instruction (ie. from ppc-opcode.h), need to use the new R0 macro, as %r0 is not going to work. Also convert usages of macros where we are just determining an offset (usually for a load/store), like: std r14,STK_REG(r14)(r1) Can't use STK_REG(r14) as %r14 doesn't work in the STK_REG macro since it's just calculating an offset. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-03powerpc: POWER7 optimised memcpy using VMX and enhanced prefetchAnton Blanchard1-0/+650
Implement a POWER7 optimised memcpy using VMX and enhanced prefetch instructions. This is a copy of the POWER7 optimised copy_to_user/copy_from_user loop. Detailed implementation and performance details can be found in commit a66086b8197d (powerpc: POWER7 optimised copy_to_user/copy_from_user using VMX). I noticed memcpy issues when profiling a RAID6 workload: .memcpy .async_memcpy .async_copy_data .__raid_run_ops .handle_stripe .raid5d .md_thread I created a simplified testcase by building a RAID6 array with 4 1GB ramdisks (booting with brd.rd_size=1048576): # mdadm -CR -e 1.2 /dev/md0 --level=6 -n4 /dev/ram[0-3] I then timed how long it took to write to the entire array: # dd if=/dev/zero of=/dev/md0 bs=1M Before: 892 MB/s After: 999 MB/s A 12% improvement. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>