aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/call-graph-from-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2017-07-14kmod: throttle kmod thread limitLuis R. Rodriguez2-31/+9
If we reach the limit of modprobe_limit threads running the next request_module() call will fail. The original reason for adding a kill was to do away with possible issues with in old circumstances which would create a recursive series of request_module() calls. We can do better than just be super aggressive and reject calls once we've reached the limit by simply making pending callers wait until the threshold has been reduced, and then throttling them in, one by one. This throttling enables requests over the kmod concurrent limit to be processed once a pending request completes. Only the first item queued up to wait is woken up. The assumption here is once a task is woken it will have no other option to also kick the queue to check if there are more pending tasks -- regardless of whether or not it was successful. By throttling and processing only max kmod concurrent tasks we ensure we avoid unexpected fatal request_module() calls, and we keep memory consumption on module loading to a minimum. With x86_64 qemu, with 4 cores, 4 GiB of RAM it takes the following run time to run both tests: time ./kmod.sh -t 0008 real 0m16.366s user 0m0.883s sys 0m8.916s time ./kmod.sh -t 0009 real 0m50.803s user 0m0.791s sys 0m9.852s Link: http://lkml.kernel.org/r/20170628223155.26472-4-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Jessica Yu <jeyu@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14kmod: add test driver to stress test the module loaderLuis R. Rodriguez7-0/+1929
This adds a new stress test driver for kmod: the kernel module loader. The new stress test driver, test_kmod, is only enabled as a module right now. It should be possible to load this as built-in and load tests early (refer to the force_init_test module parameter), however since a lot of test can get a system out of memory fast we leave this disabled for now. Using a system with 1024 MiB of RAM can *easily* get your kernel OOM fast with this test driver. The test_kmod driver exposes API knobs for us to fine tune simple request_module() and get_fs_type() calls. Since these API calls only allow each one parameter a test driver for these is rather simple. Other factors that can help out test driver though are the number of calls we issue and knowing current limitations of each. This exposes configuration as much as possible through userspace to be able to build tests directly from userspace. Since it allows multiple misc devices its will eventually (once we add a knob to let us create new devices at will) also be possible to perform more tests in parallel, provided you have enough memory. We only enable tests we know work as of right now. Demo screenshots: # tools/testing/selftests/kmod/kmod.sh kmod_test_0001_driver: OK! - loading kmod test kmod_test_0001_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0001_fs: OK! - loading kmod test kmod_test_0001_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL kmod_test_0002_driver: OK! - loading kmod test kmod_test_0002_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0002_fs: OK! - loading kmod test kmod_test_0002_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL kmod_test_0003: OK! - loading kmod test kmod_test_0003: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0004: OK! - loading kmod test kmod_test_0004: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0005: OK! - loading kmod test kmod_test_0005: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0006: OK! - loading kmod test kmod_test_0006: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0005: OK! - loading kmod test kmod_test_0005: OK! - Return value: 0 (SUCCESS), expected SUCCESS kmod_test_0006: OK! - loading kmod test kmod_test_0006: OK! - Return value: 0 (SUCCESS), expected SUCCESS XXX: add test restult for 0007 Test completed You can also request for specific tests: # tools/testing/selftests/kmod/kmod.sh -t 0001 kmod_test_0001_driver: OK! - loading kmod test kmod_test_0001_driver: OK! - Return value: 256 (MODULE_NOT_FOUND), expected MODULE_NOT_FOUND kmod_test_0001_fs: OK! - loading kmod test kmod_test_0001_fs: OK! - Return value: -22 (-EINVAL), expected -EINVAL Test completed Lastly, the current available number of tests: # tools/testing/selftests/kmod/kmod.sh --help Usage: tools/testing/selftests/kmod/kmod.sh [ -t <4-number-digit> ] Valid tests: 0001-0009 0001 - Simple test - 1 thread for empty string 0002 - Simple test - 1 thread for modules/filesystems that do not exist 0003 - Simple test - 1 thread for get_fs_type() only 0004 - Simple test - 2 threads for get_fs_type() only 0005 - multithreaded tests with default setup - request_module() only 0006 - multithreaded tests with default setup - get_fs_type() only 0007 - multithreaded tests with default setup test request_module() and get_fs_type() 0008 - multithreaded - push kmod_concurrent over max_modprobes for request_module() 0009 - multithreaded - push kmod_concurrent over max_modprobes for get_fs_type() The following test cases currently fail, as such they are not currently enabled by default: # tools/testing/selftests/kmod/kmod.sh -t 0008 # tools/testing/selftests/kmod/kmod.sh -t 0009 To be sure to run them as intended please unload both of the modules: o test_module o xfs And ensure they are not loaded on your system prior to testing them. If you use these paritions for your rootfs you can change the default test driver used for get_fs_type() by exporting it into your environment. For example of other test defaults you can override refer to kmod.sh allow_user_defaults(). Behind the scenes this is how we fine tune at a test case prior to hitting a trigger to run it: cat /sys/devices/virtual/misc/test_kmod0/config echo -n "2" > /sys/devices/virtual/misc/test_kmod0/config_test_case echo -n "ext4" > /sys/devices/virtual/misc/test_kmod0/config_test_fs echo -n "80" > /sys/devices/virtual/misc/test_kmod0/config_num_threads cat /sys/devices/virtual/misc/test_kmod0/config echo -n "1" > /sys/devices/virtual/misc/test_kmod0/config_num_threads Finally to trigger: echo -n "1" > /sys/devices/virtual/misc/test_kmod0/trigger_config The kmod.sh script uses the above constructs to build different test cases. A bit of interpretation of the current failures follows, first two premises: a) When request_module() is used userspace figures out an optimized version of module order for us. Once it finds the modules it needs, as per depmod symbol dep map, it will finit_module() the respective modules which are needed for the original request_module() request. b) We have an optimization in place whereby if a kernel uses request_module() on a module already loaded we never bother userspace as the module already is loaded. This is all handled by kernel/kmod.c. A few things to consider to help identify root causes of issues: 0) kmod 19 has a broken heuristic for modules being assumed to be built-in to your kernel and will return 0 even though request_module() failed. Upgrade to a newer version of kmod. 1) A get_fs_type() call for "xfs" will request_module() for "fs-xfs", not for "xfs". The optimization in kernel described in b) fails to catch if we have a lot of consecutive get_fs_type() calls. The reason is the optimization in place does not look for aliases. This means two consecutive get_fs_type() calls will bump kmod_concurrent, whereas request_module() will not. This one explanation why test case 0009 fails at least once for get_fs_type(). 2) If a module fails to load --- for whatever reason (kmod_concurrent limit reached, file not yet present due to rootfs switch, out of memory) we have a period of time during which module request for the same name either with request_module() or get_fs_type() will *also* fail to load even if the file for the module is ready. This explains why *multiple* NULLs are possible on test 0009. 3) finit_module() consumes quite a bit of memory. 4) Filesystems typically also have more dependent modules than other modules, its important to note though that even though a get_fs_type() call does not incur additional kmod_concurrent bumps, since userspace loads dependencies it finds it needs via finit_module_fd(), it *will* take much more memory to load a module with a lot of dependencies. Because of 3) and 4) we will easily run into out of memory failures with certain tests. For instance test 0006 fails on qemu with 1024 MiB of RAM. It panics a box after reaping all userspace processes and still not having enough memory to reap. [arnd@arndb.de: add dependencies for test module] Link: http://lkml.kernel.org/r/20170630154834.3689272-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20170628223155.26472-3-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Cc: Jessica Yu <jeyu@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14MAINTAINERS: give kmod some maintainer loveLuis R. Rodriguez1-0/+7
As suggested by Jessica, I've been actively working on kmod, so might as well reflect its maintained status. Changes are expected to go through akpm's tree. Link: http://lkml.kernel.org/r/20170628223155.26472-2-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Suggested-by: Jessica Yu <jeyu@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14xtensa: use generic fb.hTobias Klauser2-12/+1
The arch uses a verbatim copy of the asm-generic version and does not add any own implementations to the header, so use asm-generic/fb.h instead of duplicating code. Link: http://lkml.kernel.org/r/20170517083545.2115-1-tklauser@distanz.ch Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14fault-inject: add /proc/<pid>/fail-nthAkinobu Mita2-1/+3
fail-nth interface is only created in /proc/self/task/<current-tid>/. This change also adds it in /proc/<pid>/. This makes shell based tool a bit simpler. $ bash -c "builtin echo 100 > /proc/self/fail-nth && exec ls /" Link: http://lkml.kernel.org/r/1491490561-10485-6-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14fault-inject: simplify access check for fail-nthAkinobu Mita2-17/+15
The fail-nth file is created with 0666 and the access is permitted if and only if the task is current. This file is owned by the currnet user. So we can create it with 0644 and allow the owner to write it. This enables to watch the status of task->fail_nth from another processes. [akinobu.mita@gmail.com: don't convert unsigned type value as signed int] Link: http://lkml.kernel.org/r/1492444483-9239-1-git-send-email-akinobu.mita@gmail.com [akinobu.mita@gmail.com: avoid unwanted data race to task->fail_nth] Link: http://lkml.kernel.org/r/1499962492-8931-1-git-send-email-akinobu.mita@gmail.com Link: http://lkml.kernel.org/r/1491490561-10485-5-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14fault-inject: make fail-nth read/write interface symmetricAkinobu Mita2-14/+13
The read interface for fail-nth looks a bit odd. Read from this file returns "NYYYY..." or "YYYYY..." (this makes me surprise when cat this file). Because there is no EOF condition. The first character indicates current->fail_nth is zero or not, and then current->fail_nth is reset to zero. Just returning task->fail_nth value is more natural to understand. Link: http://lkml.kernel.org/r/1491490561-10485-4-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14fault-inject: parse as natural 1-based value for fail-nth write interfaceAkinobu Mita3-10/+8
The value written to fail-nth file is parsed as 0-based. Parsing as one-based is more natural to understand and it enables to cancel the previous setup by simply writing '0'. This change also converts task->fail_nth from signed to unsigned int. Link: http://lkml.kernel.org/r/1491490561-10485-3-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14fault-inject: automatically detect the number base for fail-nth write interfaceAkinobu Mita1-1/+1
Automatically detect the number base to use when writing to fail-nth file instead of always parsing as a decimal number. Link: http://lkml.kernel.org/r/1491490561-10485-2-git-send-email-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14kernel/watchdog.c: use better pr_fmt prefixKefeng Wang1-1/+1
After commit 73ce0511c436 ("kernel/watchdog.c: move hardlockup detector to separate file"), 'NMI watchdog' is inappropriate in kernel/watchdog.c, using 'watchdog' only. Link: http://lkml.kernel.org/r/1499928642-48983-1-git-send-email-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Babu Moger <babu.moger@oracle.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14MAINTAINERS: move the befs tree to kernel.orgLuis de Bethencourt1-2/+2
Update the location of the befs git tree and my email address. Link: http://lkml.kernel.org/r/20170709110012.2991-1-luisbg@kernel.org Signed-off-by: Luis de Bethencourt <luisbg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14lib/atomic64_test.c: add a test that atomic64_inc_not_zero() returns an intMichael Ellerman1-0/+7
atomic64_inc_not_zero() returns a "truth value" which in C is traditionally an int. That means callers are likely to expect the result will fit in an int. If an implementation returns a "true" value which does not fit in an int, then there's a possibility that callers will truncate it when they store it in an int. In fact this happened in practice, see commit 966d2b04e070 ("percpu-refcount: fix reference leak during percpu-atomic transition"). So add a test that the result fits in an int, even when the input doesn't. This catches the case where an implementation just passes the non-zero input value out as the result. Link: http://lkml.kernel.org/r/1499775133-1231-1-git-send-email-mpe@ellerman.id.au Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Douglas Miller <dougmill@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14mm: fix overflow check in expand_upwards()Helge Deller1-1/+1
Jörn Engel noticed that the expand_upwards() function might not return -ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE. Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa which all define TASK_SIZE as 0xffffffff, but since none of those have an upwards-growing stack we currently have no actual issue. Nevertheless let's fix this just in case any of the architectures with an upward-growing stack (currently parisc, metag and partly ia64) define TASK_SIZE similar. Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit") Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: Jörn Engel <joern@purestorage.com> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14Btrfs: fix unexpected return value of bio_readpage_errorLiu Bo2-8/+8
With blk_status_t conversion (that are now present in master), bio_readpage_error() may return 1 as now ->submit_bio_hook() may not set %ret if it runs without problems. This fixes that unexpected return value by changing btrfs_check_repairable() to return a bool instead of updating %ret, and patch is applicable to both codebases with and without blk_status_t. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14btrfs: btrfs_create_repair_bio never fails, skip error handlingDavid Sterba2-8/+0
As the function uses the non-failing bio allocation, we can remove error handling from the callers as well. Signed-off-by: David Sterba <dsterba@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14btrfs: cloned bios must not be iterated by bio_for_each_segment_allDavid Sterba4-0/+7
We've started using cloned bios more in 4.13, there are some specifics regarding the iteration. Filipe found [1] that the raid56 iterated a cloned bio using bio_for_each_segment_all, which is incorrect. The cloned bios have wrong bi_vcnt and this could lead to silent corruptions. This patch adds assertions to all remaining bio_for_each_segment_all cases. [1] https://patchwork.kernel.org/patch/9838535/ Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-07-14kvm: x86: hyperv: make VP_INDEX managed by userspaceRoman Kagan6-19/+50
Hyper-V identifies vCPUs by Virtual Processor Index, which can be queried via HV_X64_MSR_VP_INDEX msr. It is defined by the spec as a sequential number which can't exceed the maximum number of vCPUs per VM. APIC ids can be sparse and thus aren't a valid replacement for VP indices. Current KVM uses its internal vcpu index as VP_INDEX. However, to make it predictable and persistent across VM migrations, the userspace has to control the value of VP_INDEX. This patch achieves that, by storing vp_index explicitly on vcpu, and allowing HV_X64_MSR_VP_INDEX to be set from the host side. For compatibility it's initialized to KVM vcpu index. Also a few variables are renamed to make clear distinction betweed this Hyper-V vp_index and KVM vcpu_id (== APIC id). Besides, a new capability, KVM_CAP_HYPERV_VP_INDEX, is added to allow the userspace to skip attempting msr writes where unsupported, to avoid spamming error logs. Signed-off-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14KVM: async_pf: Let guest support delivery of async_pf from guest modeWanpeng Li7-7/+16
Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1, async page faults are delivered to L1 as #PF vmexits; if bit 2 is 0, kvm_can_do_async_pf returns 0 if in guest mode. This is similar to what svm.c wanted to do all along, but it is only enabled for Linux as L1 hypervisor. Foreign hypervisors must never receive async page faults as vmexits, because they'd probably be very confused about that. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14KVM: async_pf: Force a nested vmexit if the injected #PF is async_pfWanpeng Li5-10/+35
Add an nested_apf field to vcpu->arch.exception to identify an async page fault, and constructs the expected vm-exit information fields. Force a nested VM exit from nested_vmx_check_exception() if the injected #PF is async page fault. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14KVM: async_pf: Add L1 guest async_pf #PF vmexit handlerWanpeng Li5-37/+51
This patch adds the L1 guest async page fault #PF vmexit handler, such by L1 similar to ordinary async page fault. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Passed insn parameters to kvm_mmu_page_fault().] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14KVM: x86: Simplify kvm_x86_ops->queue_exception parameter listWanpeng Li4-13/+12
This patch removes all arguments except the first in kvm_x86_ops->queue_exception since they can extract the arguments from vcpu->arch.exception themselves. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-07-14KEYS: Add documentation for asymmetric keyring restrictionsMat Martineau2-8/+63
Provide more specific examples of keyring restrictions as applied to X.509 signature chain verification. Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-07-14KEYS: DH: validate __spare fieldEric Biggers2-0/+7
Syscalls must validate that their reserved arguments are zero and return EINVAL otherwise. Otherwise, it will be impossible to actually use them for anything in the future because existing programs may be passing garbage in. This is standard practice when adding new APIs. Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-07-14modsign: add markers to endif-statements in certs/MakefileJarkko Sakkinen1-3/+3
It's a bit hard for eye to track certs/Makefile if you are not accustomed to it. This commit adds comments to key endif statements in order to help to keep the context while reading this file. Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
2017-07-13vfs: in iomap seek_{hole,data}, return -ENXIO for negative offsetsDarrick J. Wong1-4/+4
In the iomap implementations of SEEK_HOLE and SEEK_DATA, make sure we return -ENXIO for negative offsets. Inspired-by: Mateusz S <muttdini@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13Revert "xfs: grab dquots without taking the ilock"Christoph Hellwig2-12/+4
This reverts commit 50e0bdbe9f48f98bb02eac7030d682f4716884ae. The new XFS_QMOPT_NOLOCK isn't used at all, and conditional locking based on a flag is always the wrong thing to do - we should be having helpers that can be called without the lock instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13xfs: assert locking precondition in xfs_readlink_bmap_ilockedChristoph Hellwig1-0/+2
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13xfs: assert locking precondіtion in xfs_attr_list_int_ilockedChristoph Hellwig1-0/+2
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13xfs: fixup xfs_attr_get_ilockedChristoph Hellwig1-1/+3
The comment mentioned the wrong lock. Also add an ASSERT to assert this locking precondition. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13NFS: Don't run wake_up_bit() when nobody is waiting...Trond Myklebust2-1/+18
"perf lock" shows fairly heavy contention for the bit waitqueue locks when doing an I/O heavy workload. Use a bit to tell whether or not there has been contention for a lock so that we can optimise away the bit waitqueue options in those cases. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13nfs: add export operationsPeng Tao4-1/+182
This support for opening files on NFS by file handle, both through the open_by_handle syscall, and for re-exporting NFS (for example using a different version). The support is very basic for now, as each open by handle will have to do an NFSv4 open operation on the wire. In the future this will hopefully be mitigated by an open file cache, as well as various optimizations in NFS for this specific case. Signed-off-by: Peng Tao <tao.peng@primarydata.com> [hch: incorporated various changes, resplit the patches, new changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13nfs4: add NFSv4 LOOKUPP handlersJeff Layton5-1/+170
This will be needed in order to implement the get_parent export op for nfsd. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13nfs: add a nfs_ilookup helperPeng Tao2-0/+23
This helper will allow to find an existing NFS inode by the file handle and fattr. Signed-off-by: Peng Tao <tao.peng@primarydata.com> [hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13nfs: replace d_add with d_splice_alias in atomic_openPeng Tao1-1/+1
It's a trival change but follows knfsd export document that asks for d_splice_alias during lookup. Signed-off-by: Peng Tao <tao.peng@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13sunrpc: use constant time memory comparison for macJason A. Donenfeld1-1/+2
Otherwise, we enable a MAC forgery via timing attack. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@poochiereds.net> Cc: Trond Myklebust <trond.myklebust@primarydata.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: linux-nfs@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13NFSv4.2 fix size storage for nfs42_proc_copyOlga Kornievskaia1-1/+1
Return size of COPY is u64 but it was assigned to an "int" status. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Fix documenting comments in frwr_ops.cChuck Lever1-3/+3
Clean up. FASTREG and LOCAL_INV WRs are typically not signaled. localinv_wake is used for the last LOCAL_INV WR in a chain, which is always signaled. The documenting comments should reflect that. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Replace PAGE_MASK with offset_in_page()Chuck Lever1-8/+8
Clean up. Reported by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: FMR does not need list_del_init()Chuck Lever1-8/+10
Clean up. Commit 38f1932e60ba ("xprtrdma: Remove FMRs from the unmap list after unmapping") utilized list_del_init() to try to prevent some list corruption. The corruption was actually caused by the reply handler racing with a signal. Now that MR invalidation is properly serialized, list_del_init() can safely be replaced. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Demote "connect" log messagesChuck Lever1-36/+8
Some have complained about the log messages generated when xprtrdma opens or closes a connection to a server. When an NFS mount is mostly idle these can appear every few minutes as the client idles out the connection and reconnects. Connection and disconnection is a normal part of operation, and not exceptional, so change these to dprintk's for now. At some point all of these will be converted to tracepoints, but that's for another day. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13NFSv4.1: Use seqid returned by EXCHANGE_ID after state migrationChuck Lever1-4/+3
Transparent State Migration copies a client's lease state from the server where a filesystem used to reside to the server where it now resides. When an NFSv4.1 client first contacts that destination server, it uses EXCHANGE_ID to detect trunking relationships. The lease that was copied there is returned to that client, but the destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to the client. This is because the lease was confirmed on the source server (before it was copied). When CONFIRMED_R is set, the client throws away the sequence ID returned by the server. During a Transparent State Migration, however there's no other way for the client to know what sequence ID to use with a lease that's been migrated. Therefore, the client must save and use the contrived slot sequence value returned by the destination server even when CONFIRMED_R is set. Note that some servers always return a seqid of 1 after a migration. Reported-by: Xuan Qi <xuan.qi@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Xuan Qi <xuan.qi@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13NFSv4.1: Handle EXCHGID4_FLAG_CONFIRMED_R during NFSv4.1 migrationChuck Lever3-5/+18
Transparent State Migration copies a client's lease state from the server where a filesystem used to reside to the server where it now resides. When an NFSv4.1 client first contacts that destination server, it uses EXCHANGE_ID to detect trunking relationships. The lease that was copied there is returned to that client, but the destination server sets EXCHGID4_FLAG_CONFIRMED_R when replying to the client. This is because the lease was confirmed on the source server (before it was copied). Normally, when CONFIRMED_R is set, a client purges the lease and creates a new one. However, that throws away the entire benefit of Transparent State Migration. Therefore, the client must not purge that lease when it is possible that Transparent State Migration has occurred. Reported-by: Xuan Qi <xuan.qi@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Xuan Qi <xuan.qi@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Don't defer MR recovery if ro_map failsChuck Lever2-20/+17
Deferred MR recovery does a DMA-unmapping of the MW. However, ro_map invokes rpcrdma_defer_mr_recovery in some error cases where the MW has not even been DMA-mapped yet. Avoid a DMA-unmapping error replacing rpcrdma_defer_mr_recovery. Also note that if ib_dma_map_sg is asked to map 0 nents, it will return 0. So the extra "if (i == 0)" check is no longer needed. Fixes: 42fe28f60763 ("xprtrdma: Do not leak an MW during a DMA ...") Fixes: 505bbe64dd04 ("xprtrdma: Refactor MR recovery work queues") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Fix FRWR invalidation error recoveryChuck Lever1-10/+13
When ib_post_send() fails, all LOCAL_INV WRs past @bad_wr have to be examined, and the MRs reset by hand. I'm not sure how the existing code can work by comparing R_keys. Restructure the logic so that instead it walks the chain of WRs, starting from the first bad one. Make sure to wait for completion if at least one WR was actually posted. Otherwise, if the ib_post_send fails, we can end up DMA-unmapping the MR while LOCAL_INV operations are in flight. Commit 7a89f9c626e3 ("xprtrdma: Honor ->send_request API contract") added the rdma_disconnect() call site. The disconnect actually causes more problems than it solves, and SQ overruns happen only as a result of software bugs. So remove it. Fixes: d7a21c1bed54 ("xprtrdma: Reset MRs in frwr_op_unmap_sync()") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Fix client lock-up after application signal firesChuck Lever4-31/+84
After a signal, the RPC client aborts synchronous RPCs running on behalf of the signaled application. The server is still executing those RPCs, and will write the results back into the client's memory when it's done. By the time the server writes the results, that memory is likely being used for other purposes. Therefore xprtrdma has to immediately invalidate all memory regions used by those aborted RPCs to prevent the server's writes from clobbering that re-used memory. With FMR memory registration, invalidation takes a relatively long time. In fact, the invalidation is often still running when the server tries to write the results into the memory regions that are being invalidated. This sets up a race between two processes: 1. After the signal, xprt_rdma_free calls ro_unmap_safe. 2. While ro_unmap_safe is still running, the server replies and rpcrdma_reply_handler runs, calling ro_unmap_sync. Both processes invoke ib_unmap_fmr on the same FMR. The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at the same time, but HCAs generally don't tolerate this. Sometimes this can result in a system crash. If the HCA happens to survive, rpcrdma_reply_handler continues. It removes the rpc_rqst from rq_list and releases the transport_lock. This enables xprt_rdma_free to run in another process, and the rpc_rqst is released while rpcrdma_reply_handler is still waiting for the ib_unmap_fmr call to finish. But further down in rpcrdma_reply_handler, the transport_lock is taken again, and "rqst" is dereferenced. If "rqst" has already been released, this triggers a general protection fault. Since bottom- halves are disabled, the system locks up. Address both issues by reversing the order of the xprt_lookup_rqst call and the ro_unmap_sync call. Introduce a separate lookup mechanism for rpcrdma_req's to enable calling ro_unmap_sync before xprt_lookup_rqst. Now the handler takes the transport_lock once and holds it for the XID lookup and RPC completion. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Rename rpcrdma_req::rl_freeChuck Lever2-6/+5
Clean up: I'm about to use the rl_free field for purposes other than a free list. So use a more generic name. This is a refactoring change only. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Pass only the list of registered MRs to ro_unmap_syncChuck Lever4-27/+26
There are rare cases where an rpcrdma_req can be re-used (via rpcrdma_buffer_put) while the RPC reply handler is still running. This is due to a signal firing at just the wrong instant. Since commit 9d6b04097882 ("xprtrdma: Place registered MWs on a per-req list"), rpcrdma_mws are self-contained; ie., they fully describe an MR and scatterlist, and no part of that information is stored in struct rpcrdma_req. As part of closing the above race window, pass only the req's list of registered MRs to ro_unmap_sync, rather than the rpcrdma_req itself. Some extra transport header sanity checking is removed. Since the client depends on its own recollection of what memory had been registered, there doesn't seem to be a way to abuse this change. And, the check was not terribly effective. If the client had sent Read chunks, the "list_empty" test is negative in both of the removed cases, which are actually looking for Write or Reply chunks. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: Pre-mark remotely invalidated MRsChuck Lever4-5/+28
There are rare cases where an rpcrdma_req and its matched rpcrdma_rep can be re-used, via rpcrdma_buffer_put, while the RPC reply handler is still using that req. This is typically due to a signal firing at just the wrong instant. As part of closing this race window, avoid using the wrong rpcrdma_rep to detect remotely invalidated MRs. Mark MRs as invalidated while we are sure the rep is still OK to use. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305 Fixes: 68791649a725 ('xprtrdma: Invalidate in the RPC reply ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13xprtrdma: On invalidation failure, remove MWs from rl_registeredChuck Lever1-0/+1
Callers assume the ro_unmap_sync and ro_unmap_safe methods empty the list of registered MRs. Ensure that all paths through fmr_op_unmap_sync() remove MWs from that list. Fixes: 9d6b04097882 ("xprtrdma: Place registered MWs on a ... ") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-07-13NFS: check for nfs_refresh_inode() errors in nfs_fhget()NeilBrown1-2/+8
If an NFS server returns a filehandle that we have previously seen, and reports a different type, then nfs_refresh_inode() will log a warning and return an error. nfs_fhget() does not check for this error and may return an inode with a different type than the one that the server reported. This is likely to cause confusion, and is one way that ->open_context() could return a directory inode as discussed in the previous patch. So if nfs_refresh_inode() returns and error, return that error from nfs_fhget() to avoid the confusion propagating. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>