aboutsummaryrefslogtreecommitdiffstats
path: root/fs/openpromfs (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2015-11-26arm64: efi: fix initcall return valuesArd Biesheuvel1-5/+5
Even though initcall return values are typically ignored, the prototype is to return 0 on success or a negative errno value on error. So fix the arm_enable_runtime_services() implementation to return 0 on conditions that are not in fact errors, and return a meaningful error code otherwise. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-26arm64: efi: deal with NULL return value of early_memremap()Ard Biesheuvel1-1/+13
Add NULL return value checks to two invocations of early_memremap() in the UEFI init code. For the UEFI configuration tables, we just warn since we have a better chance of being able to report the issue in a way that can actually be noticed by a human operator if we don't abort right away. For the UEFI memory map, however, all we can do is panic() since we cannot proceed without a description of memory. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-26arm64: debug: Treat the BRPs/WRPs as unsignedSuzuki K. Poulose1-2/+4
IDAA64DFR0_EL1: BRPs and WRPs are unsigned values. Use the appropriate helpers to extract those fields. Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reported-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-26arm64: cpufeature: Track unsigned fieldsSuzuki K. Poulose2-16/+31
Some of the feature bits have unsigned values and need to be treated accordingly to avoid errors. Adds the property to the feature bits and use the appropriate field extract helpers. Reported-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-26xen/events: Always allocate legacy interrupts on PV guestsBoris Ostrovsky3-2/+13
After commit 8c058b0b9c34 ("x86/irq: Probe for PIC presence before allocating descs for legacy IRQs") early_irq_init() will no longer preallocate descriptors for legacy interrupts if PIC does not exist, which is the case for Xen PV guests. Therefore we may need to allocate those descriptors ourselves. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-11-26arm64: cpufeature: Add helpers for extracting unsigned valuesSuzuki K. Poulose1-0/+12
The cpuid_feature_extract_field() extracts the feature value as a signed integer. This could be problematic for features whose values are unsigned. e.g, ID_AA64DFR0_EL1:BRPs. Add an unsigned variant for the unsigned fields. Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reported-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-26xen/gntdev: Grant maps should not be subject to NUMA balancingBoris Ostrovsky1-1/+1
Doing so will cause the grant to be unmapped and then, during fault handling, the fault to be mistakenly treated as NUMA hint fault. In addition, even if those maps could partcipate in NUMA balancing, it wouldn't provide any benefit since we are unable to determine physical page's node (even if/when VNUMA is implemented). Marking grant maps' VMAs as VM_IO will exclude them from being part of NUMA balancing. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-11-26Revert "arm64: Mark kernel page ranges contiguous"Catalin Marinas1-61/+8
This reverts commit 348a65cdcbbf243073ee39d1f7d4413081ad7eab. Incorrect page table manipulation that does not respect the ARM ARM recommended break-before-make sequence may lead to TLB conflicts. The contiguous PTE patch makes the system even more susceptible to such errors by changing the mapping from a single page to a contiguous range of pages. An additional TLB invalidation would reduce the risk window, however, the correct fix is to switch to a temporary swapper_pg_dir. Once the correct workaround is done, the reverted commit will be re-applied. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Jeremy Linton <jeremy.linton@arm.com>
2015-11-26arm64: mm: keep reserved ASIDs in sync with mm after multiple rolloversWill Deacon1-12/+26
Under some unusual context-switching patterns, it is possible to end up with multiple threads from the same mm running concurrently with different ASIDs: 1. CPU x schedules task t with mm p containing ASID a and generation g This task doesn't block and the CPU doesn't context switch. So: * per_cpu(active_asid, x) = {g,a} * p->context.id = {g,a} 2. Some other CPU generates an ASID rollover. The global generation is now (g + 1). CPU x is still running t, with no context switch and so per_cpu(reserved_asid, x) = {g,a} 3. CPU y schedules task t', which shares mm p with t. The generation mismatches, so we take the slowpath and hit the reserved ASID from CPU x. p is then updated so that p->context.id = {g + 1,a} 4. CPU y schedules some other task u, which has an mm != p. 5. Some other CPU generates *another* CPU rollover. The global generation is now (g + 2). CPU x is still running t, with no context switch and so per_cpu(reserved_asid, x) = {g,a}. 6. CPU y once again schedules task t', but now *fails* to hit the reserved ASID from CPU x because of the generation mismatch. This results in a new ASID being allocated, despite the fact that t is still running on CPU x with the same mm. Consequently, TLBIs (e.g. as a result of CoW) will not be synchronised between the two threads. This patch fixes the problem by updating all of the matching reserved ASIDs when we hit on the slowpath (i.e. in step 3 above). This keeps the reserved ASIDs in-sync with the mm and avoids the problem. Reported-by: Tony Thompson <anthony.thompson@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-26arm64: KASAN depends on !(ARM64_16K_PAGES && ARM64_VA_BITS_48)Andrey Ryabinin1-1/+1
On KASAN + 16K_PAGES + 48BIT_VA arch/arm64/mm/kasan_init.c: In function ‘kasan_early_init’: include/linux/compiler.h:484:38: error: call to ‘__compiletime_assert_95’ declared with attribute error: BUILD_BUG_ON failed: !IS_ALIGNED(KASAN_SHADOW_END, PGDIR_SIZE) _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__) Currently KASAN will not work on 16K_PAGES and 48BIT_VA, so forbid such configuration to avoid above build failure. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Suzuki K. Poulose <Suzuki.Poulose@arm.com> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-26nios2: fix cache coherencyLey Foon Tan1-20/+4
There is intermittent cache coherency issue caught in toolchian tests. Revert to use flushd. Signed-off-by: Ley Foon Tan <lftan@altera.com>
2015-11-25intel_pstate: Fix "performance" mode behavior with HWP enabledAlexandra Yates1-0/+2
If hardware-driven P-state selection (HWP) is enabled, the "performance" mode of intel_pstate should only allow the processor to use the highest-performance P-state available. That is not the case currently, so make it actually happen. Acked-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Alexandra Yates <alexandra.yates@linux.intel.com> [ rjw: Subject and changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-11-25nfs4: resend LAYOUTGET when there is a race that changes the seqidJeff Layton1-25/+31
pnfs_layout_process will check the returned layout stateid against what the kernel has in-core. If it turns out that the stateid we received is older, then we should resend the LAYOUTGET instead of falling back to MDS I/O. Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Cc: stable@vger.kernel.org # 3.18+ Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-25nfs: if we have no valid attrs, then don't declare the attribute cache validJeff Layton1-1/+5
If we pass in an empty nfs_fattr struct to nfs_update_inode, it will (correctly) not update any of the attributes, but it then clears the NFS_INO_INVALID_ATTR flag, which indicates that the attributes are up to date. Don't clear the flag if the fattr struct has no valid attrs to apply. Reviewed-by: Steve French <steve.french@primarydata.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-25nfs: ensure that attrcache is revalidated after a SETATTRJeff Layton1-1/+4
If we get no post-op attributes back from a SETATTR operation, then no attributes will of course be updated during the call to nfs_update_inode. We know however that the attributes are invalid at that point, since we just changed some of them. At the very least, the ctime will be bogus. If we get no post-op attributes back on the call, mark the attrcache invalid to reflect that fact. Reviewed-by: Steve French <steve.french@primarydata.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-11-25ARM: OMAP4+: SMP: use lockless clkdm/pwrdm api in omap4_boot_secondaryGrygorii Strashko1-3/+3
OMAP CPU hotplug uses cpu1's clocks and power domains for CPU1 wake up from low power states (or turn on CPU1). This part of code is also part of system suspend (disable_nonboot_cpus()). >From other side, cpu1's clocks and power domains are used by CPUIdle. All above functionality is mutually exclusive and, therefore, lockless clkdm/pwrdm api can be used in omap4_boot_secondary(). This fixes below back-trace on -RT which is triggered by pwrdm_lock/unlock(): BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917 in_atomic(): 1, irqs_disabled(): 0, pid: 118, name: sh 9 locks held by sh/118: #0: (sb_writers#4){.+.+.+}, at: [<c0144a6c>] vfs_write+0x13c/0x164 #1: (&of->mutex){+.+.+.}, at: [<c01b4c70>] kernfs_fop_write+0x48/0x19c #2: (s_active#24){.+.+.+}, at: [<c01b4c78>] kernfs_fop_write+0x50/0x19c #3: (device_hotplug_lock){+.+.+.}, at: [<c03cbff0>] lock_device_hotplug_sysfs+0xc/0x4c #4: (&dev->mutex){......}, at: [<c03cd284>] device_online+0x14/0x88 #5: (cpu_add_remove_lock){+.+.+.}, at: [<c003af90>] cpu_up+0x50/0x1a0 #6: (cpu_hotplug.lock){++++++}, at: [<c003ae48>] cpu_hotplug_begin+0x0/0xc4 #7: (cpu_hotplug.lock#2){+.+.+.}, at: [<c003aec0>] cpu_hotplug_begin+0x78/0xc4 #8: (boot_lock){+.+...}, at: [<c002b254>] omap4_boot_secondary+0x1c/0x178 Preemption disabled at:[< (null)>] (null) CPU: 0 PID: 118 Comm: sh Not tainted 4.1.12-rt11-01998-gb4a62c3-dirty #137 Hardware name: Generic DRA74X (Flattened Device Tree) [<c0017574>] (unwind_backtrace) from [<c0013be8>] (show_stack+0x10/0x14) [<c0013be8>] (show_stack) from [<c05a8670>] (dump_stack+0x80/0x94) [<c05a8670>] (dump_stack) from [<c05ad158>] (rt_spin_lock+0x24/0x54) [<c05ad158>] (rt_spin_lock) from [<c0030dac>] (clkdm_wakeup+0x10/0x2c) [<c0030dac>] (clkdm_wakeup) from [<c002b2c0>] (omap4_boot_secondary+0x88/0x178) [<c002b2c0>] (omap4_boot_secondary) from [<c0015d00>] (__cpu_up+0xc4/0x164) [<c0015d00>] (__cpu_up) from [<c003b09c>] (cpu_up+0x15c/0x1a0) [<c003b09c>] (cpu_up) from [<c03cd2d4>] (device_online+0x64/0x88) [<c03cd2d4>] (device_online) from [<c03cd360>] (online_store+0x68/0x74) [<c03cd360>] (online_store) from [<c01b4ce0>] (kernfs_fop_write+0xb8/0x19c) [<c01b4ce0>] (kernfs_fop_write) from [<c0144124>] (__vfs_write+0x20/0xd8) [<c0144124>] (__vfs_write) from [<c01449c0>] (vfs_write+0x90/0x164) [<c01449c0>] (vfs_write) from [<c01451e4>] (SyS_write+0x44/0x9c) [<c01451e4>] (SyS_write) from [<c0010240>] (ret_fast_syscall+0x0/0x54) CPU1: smp_ops.cpu_die() returned, trying to resuscitate Cc: Tero Kristo <t-kristo@ti.com> Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com>
2015-11-25arm: omap2+: add missing HWMOD_NO_IDLEST in 81xx hwmod dataNeil Armstrong1-0/+3
Add missing HWMOD_NO_IDLEST hwmod flag for entries not having omap4 clkctrl values. The emac0 hwmod flag fixes the davinci_emac driver probe since the return of pm_resume() call is now checked. This solves the following boot errors : [ 0.121429] omap_hwmod: l4_ls: _wait_target_ready failed: -16 [ 0.121441] omap_hwmod: l4_ls: cannot be enabled for reset (3) [ 0.124342] omap_hwmod: l4_hs: _wait_target_ready failed: -16 [ 0.124352] omap_hwmod: l4_hs: cannot be enabled for reset (3) [ 1.967228] omap_hwmod: emac0: _wait_target_ready failed: -16 Cc: Brian Hutchinson <b.hutchman@gmail.com> Signed-off-by: Neil Armstrong <narmstrong@baylibre.com> Signed-off-by: Tony Lindgren <tony@atomide.com>
2015-11-25Revert "blk-flush: Queue through IO scheduler when flush not required"Jens Axboe1-1/+1
This reverts commit 1b2ff19e6a957b1ef0f365ad331b608af80e932e. Jan writes: -- Thanks for report! After some investigation I found out we allocate elevator specific data in __get_request() only for non-flush requests. And this is actually required since the flush machinery uses the space in struct request for something else. Doh. So my patch is just wrong and not easy to fix since at the time __get_request() is called we are not sure whether the flush machinery will be used in the end. Jens, please revert 1b2ff19e6a957b1ef0f365ad331b608af80e932e. Thanks! I'm somewhat surprised that you can reliably hit the race where flushing gets disabled for the device just while the request is in flight. But I guess during boot it makes some sense. -- So let's just revert it, we can fix the queue run manually after the fact. This race is rare enough that it didn't trigger in testing, it requires the specific disable-while-in-flight scenario to trigger.
2015-11-25arm64: efi: correctly map runtime regionsMark Rutland1-7/+2
The kernel may use a page granularity of 4K, 16K, or 64K depending on configuration. When mapping EFI runtime regions, we use memrange_efi_to_native to round the physical base address of a region down to a kernel page boundary, and round the size up to a kernel page boundary, adding the residue left over from rounding down the physical base address. We do not round down the virtual base address. In __create_mapping we account for the offset of the virtual base from a granule boundary, adding the residue to the size before rounding the base down to said granule boundary. Thus we account for the residue twice, and when the residue is non-zero will cause __create_mapping to map an additional page at the end of the region. Depending on the memory map, this page may be in a region we are not intended/permitted to map, or may clash with a different region that we wish to map. In typical cases, mapping the next item in the memory map will overwrite the erroneously created entry, as we sort the memory map in the stub. As __create_mapping can cope with base addresses which are not page aligned, we can instead rely on it to map the region appropriately, and simplify efi_virtmap_init by removing the unnecessary code. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Leif Lindholm <leif.lindholm@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-25arm64: mm: fix fault_info table xFSC decodingMark Rutland1-14/+14
We are missing descriptions for some valid xFSC values in the fault info table (e.g. "TLB conflict abort"), and have erroneous descriptions for reserved values (e.g. "asynchronous external abort", "debug event"). This patch adds the missing xFSC values, and removes erroneous decoding of values reserved by the architecture, as described in ARM DDI 0487A.h. At the same time, fixed the unbalanced brackets for the synchronous parity error strings in the table. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-25arm64: fix building without CONFIG_UID16Arnd Bergmann2-2/+2
As reported by Michal Simek, building an ARM64 kernel with CONFIG_UID16 disabled currently fails because the system call table still needs to reference the individual function entry points that are provided by kernel/sys_ni.c in this case, and the declarations are hidden inside of #ifdef CONFIG_UID16: arch/arm64/include/asm/unistd32.h:57:8: error: 'sys_lchown16' undeclared here (not in a function) __SYSCALL(__NR_lchown, sys_lchown16) I believe this problem only exists on ARM64, because older architectures tend to not need declarations when their system call table is built in assembly code, while newer architectures tend to not need UID16 support. ARM64 only uses these system calls for compatibility with 32-bit ARM binaries. This changes the CONFIG_UID16 check into CONFIG_HAVE_UID16, which is set unconditionally on ARM64 with CONFIG_COMPAT, so we see the declarations whenever we need them, but otherwise the behavior is unchanged. Fixes: af1839eb4bd4 ("Kconfig: clean up the long arch list for the UID16 config option") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Will Deacon <will.deacon@arm.com> Cc: stable@vger.kernel.org Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-25ARM: orion5x: Fix legacy get_irqnr_and_baseNicolas Pitre1-1/+1
Commit 5be9fc23cd ("ARM: orion5x: fix legacy orion5x IRQ numbers") shifted IRQ numbers by one but didn't update the get_irqnr_and_base macro accordingly. This macro is involved when CONFIG_MULTI_IRQ_HANDLER is not defined. [jac: 5d6bed2a9c went in to v4.2, but was backported to v3.18] Signed-off-by: Nicolas Pitre <nico@linaro.org> Fixes: 5be9fc23cd ("ARM: orion5x: fix legacy orion5x IRQ numbers") Cc: <stable@vger.kernel.org> # v3.18+ Signed-off-by: Jason Cooper <jason@lakedaemon.net>
2015-11-25ARM: dove: Fix legacy get_irqnr_and_baseNicolas Pitre1-2/+2
Commit 5d6bed2a9c ("ARM: dove: fix legacy dove IRQ numbers") shifted IRQ numbers by one but didn't update the get_irqnr_and_base macro accordingly. This macro is involved when CONFIG_MULTI_IRQ_HANDLER is not defined. [jac: 5d6bed2a9c went in to v4.2, but was backported to v3.18] Signed-off-by: Nicolas Pitre <nico@linaro.org> Fixes: 5d6bed2a9c ("ARM: dove: fix legacy dove IRQ numbers") Cc: <stable@vger.kernel.org> # v3.18+ Signed-off-by: Jason Cooper <jason@lakedaemon.net>
2015-11-25KVM: nVMX: remove incorrect vpid check in nested invvpid emulationHaozhong Zhang1-5/+0
This patch removes the vpid check when emulating nested invvpid instruction of type all-contexts invalidation. The existing code is incorrect because: (1) According to Intel SDM Vol 3, Section "INVVPID - Invalidate Translations Based on VPID", invvpid instruction does not check vpid in the invvpid descriptor when its type is all-contexts invalidation. (2) According to the same document, invvpid of type all-contexts invalidation does not require there is an active VMCS, so/and get_vmcs12() in the existing code may result in a NULL-pointer dereference. In practice, it can crash both KVM itself and L1 hypervisors that use invvpid (e.g. Xen). Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25btrfs: fix balance range usage filters in 4.4-rcHolger Hoffstätte1-2/+2
There's a regression in 4.4-rc since commit bc3094673f22 (btrfs: extend balance filter usage to take minimum and maximum) in that existing (non-ranged) balance with -dusage=x no longer works; all chunks are skipped. After staring at the code for a while and wondering why a non-ranged balance would even need min and max thresholds (..which then were not set correctly, leading to the bug) I realized that the only problem was the fact that the filter functions were named wrong, thanks to patching copypasta. Simply renaming both functions lets the existing btrfs-progs call balance with -dusage=x and now the non-ranged filter function is invoked, properly using only a single chunk limit. Signed-off-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Fixes: bc3094673f22 ("btrfs: extend balance filter usage to take minimum and maximum") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: qgroup: account shared subtree during snapshot deleteMark Fasheh2-7/+42
Commit 0ed4792 ('btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.') removed our qgroup accounting during btrfs_drop_snapshot(). Predictably, this results in qgroup numbers going bad shortly after a snapshot is removed. Fix this by adding a dirty extent record when we encounter extents during our shared subtree walk. This effectively restores the functionality we had with the original shared subtree walking code in 1152651 (btrfs: qgroup: account shared subtrees during snapshot delete). The idea with the original patch (and this one) is that shared subtrees can get skipped during drop_snapshot. The shared subtree walk then allows us a chance to visit those extents and add them to the qgroup work for later processing. This ultimately makes the accounting for drop snapshot work. The new qgroup code nicely handles all the other extents during the tree walk via the ref dec/inc functions so we don't have to add actions beyond what we had originally. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: use btrfs_get_fs_root in resolve_indirect_refJosef Bacik1-1/+1
The backref code will look up the fs_root we're trying to resolve our indirect refs for, unfortunately we use btrfs_read_fs_root_no_name, which returns -ENOENT if the ref is 0. This isn't helpful for the qgroup stuff with snapshot delete as it won't be able to search down the snapshot we are deleting, which will cause us to miss roots. So use btrfs_get_fs_root and send false for check_ref so we can always get the root we're looking for. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: qgroup: fix quota disable during rescanJustin Maggard1-1/+2
There's a race condition that leads to a NULL pointer dereference if you disable quotas while a quota rescan is running. To fix this, we just need to wait for the quota rescan worker to actually exit before tearing down the quota structures. Signed-off-by: Justin Maggard <jmaggard@netgear.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: fix race between cleaner kthread and space cache writeoutFilipe Manana1-13/+16
When a block group becomes unused and the cleaner kthread is currently running, we can end up getting the current transaction aborted with error -ENOENT when we try to commit the transaction, leading to the following trace: [59779.258768] WARNING: CPU: 3 PID: 5990 at fs/btrfs/extent-tree.c:3740 btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs]() [59779.272594] BTRFS: Transaction aborted (error -2) (...) [59779.291137] Call Trace: [59779.291621] [<ffffffff812566f4>] dump_stack+0x4e/0x79 [59779.292543] [<ffffffff8104d0a6>] warn_slowpath_common+0x9f/0xb8 [59779.293435] [<ffffffffa04cb81f>] ? btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [59779.295000] [<ffffffff8104d107>] warn_slowpath_fmt+0x48/0x50 [59779.296138] [<ffffffffa04c2721>] ? write_one_cache_group.isra.32+0x77/0x82 [btrfs] [59779.297663] [<ffffffffa04cb81f>] btrfs_write_dirty_block_groups+0x17c/0x214 [btrfs] [59779.299141] [<ffffffffa0549b0d>] commit_cowonly_roots+0x1de/0x261 [btrfs] [59779.300359] [<ffffffffa04dd5b6>] btrfs_commit_transaction+0x4c4/0x99c [btrfs] [59779.301805] [<ffffffffa04b5df4>] btrfs_sync_fs+0x145/0x1ad [btrfs] [59779.302893] [<ffffffff81196634>] sync_filesystem+0x7f/0x93 (...) [59779.318186] ---[ end trace 577e2daff90da33a ]--- The following diagram illustrates a sequence of steps leading to this problem: CPU 1 CPU 2 <at transaction N> adds bg A to list fs_info->unused_bgs adds bg B to list fs_info->unused_bgs <transaction kthread commits transaction N and wakes up the cleaner kthread> cleaner kthread delete_unused_bgs() sees bg A in list fs_info->unused_bgs btrfs_start_transaction() <transaction N + 1 starts> deletes bg A update_block_group(bg C) --> adds bg C to list fs_info->unused_bgs deletes bg B sees bg C in the list fs_info->unused_bgs btrfs_remove_chunk(bg C) btrfs_remove_block_group(bg C) --> checks if the block group is in a dirty list, and because it isn't now, it does nothing --> the block group item is deleted from the extent tree --> adds bg C to list transaction->dirty_bgs some task calls btrfs_commit_transaction(t N + 1) commit_cowonly_roots() btrfs_write_dirty_block_groups() --> sees bg C in cur_trans->dirty_bgs --> calls write_one_cache_group() which returns -ENOENT because it did not find the block group item in the extent tree --> transaction aborte with -ENOENT because write_one_cache_group() returned that error So fix this by adding a block group to the list of dirty block groups before adding it to the list of unused block groups. This happened on a stress test using fsstress plus concurrent calls to fallocate 20G and truncate (releasing part of the space allocated with fallocate). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: fix scrub preventing unused block groups from being deletedFilipe Manana3-1/+24
Currently scrub can race with the cleaner kthread when the later attempts to delete an unused block group, and the result is preventing the cleaner kthread from ever deleting later the block group - unless the block group becomes used and unused again. The following diagram illustrates that race: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs and removes it from that list scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO sees the block group is already RO and therefore doesn't delete it nor adds it back to unused list So fix this by making scrub add the block group again to the list of unused block groups if the block group is still unused when it finished scrubbing it and it hasn't been removed already. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: fix race between scrub and block group deletionFilipe Manana1-4/+16
Scrub can race with the cleaner kthread deleting block groups that are unused (and with relocation too) leading to a failure with error -EINVAL that gets returned to user space. The following diagram illustrates how it happens: CPU 1 CPU 2 cleaner kthread btrfs_delete_unused_bgs() gets block group X from fs_info->unused_bgs sets block group to RO btrfs_remove_chunk(bg X) deletes device extents scrub_enumerate_chunks() searches device tree using its commit root finds device extent for block group X gets block group X from the tree fs_info->block_group_cache_tree (via btrfs_lookup_block_group()) sets bg X to RO (again) btrfs_remove_block_group(bg X) deletes block group from fs_info->block_group_cache_tree removes extent map from fs_info->mapping_tree scrub_chunk(offset X) searches fs_info->mapping_tree for extent map starting at offset X --> doesn't find any such extent map --> returns -EINVAL and scrub errors out to userspace with -EINVAL Fix this by dealing with an extent map lookup failure as an indicator of block group deletion. Issue reproduced with fstest btrfs/071. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: fix rcu warning during device replaceDavid Sterba1-4/+2
The test btrfs/011 triggers a rcu warning Reviewed-by: Anand Jain <anand.jain@oracle.com> =============================== [ INFO: suspicious RCU usage. ] 4.4.0-rc1-default+ #286 Tainted: G W ------------------------------- fs/btrfs/volumes.c:1977 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 4 locks held by btrfs/28786: 0: (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+...}, at: [<ffffffffa00bc785>] btrfs_dev_replace_finishing+0x45/0xa00 [btrfs] 1: (uuid_mutex){+.+.+.}, at: [<ffffffffa00bc84f>] btrfs_dev_replace_finishing+0x10f/0xa00 [btrfs] 2: (&fs_devs->device_list_mutex){+.+.+.}, at: [<ffffffffa00bc868>] btrfs_dev_replace_finishing+0x128/0xa00 [btrfs] 3: (&fs_info->chunk_mutex){+.+...}, at: [<ffffffffa00bc87d>] btrfs_dev_replace_finishing+0x13d/0xa00 [btrfs] stack backtrace: CPU: 0 PID: 28786 Comm: btrfs Tainted: G W 4.4.0-rc1-default+ #286 Hardware name: Intel Corporation SandyBridge Platform/To be filled by O.E.M., BIOS ASNBCPT1.86C.0031.B00.1006301607 06/30/2010 0000000000000001 ffff8800a07dfb48 ffffffff8141d47b 0000000000000001 0000000000000001 0000000000000000 ffff8801464a4f00 ffff8800a07dfb78 ffffffff810cd883 ffff880146eb9400 ffff8800a3698600 ffff8800a33fe220 Call Trace: [<ffffffff8141d47b>] dump_stack+0x4f/0x74 [<ffffffff810cd883>] lockdep_rcu_suspicious+0x103/0x140 [<ffffffffa0071261>] btrfs_rm_dev_replace_remove_srcdev+0x111/0x130 [btrfs] [<ffffffff810d354d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff81449536>] ? __percpu_counter_sum+0x66/0x80 [<ffffffffa00bcc15>] btrfs_dev_replace_finishing+0x4d5/0xa00 [btrfs] [<ffffffffa00bc96e>] ? btrfs_dev_replace_finishing+0x22e/0xa00 [btrfs] [<ffffffffa00a8795>] ? btrfs_scrub_dev+0x415/0x6d0 [btrfs] [<ffffffffa003ea69>] ? btrfs_start_transaction+0x9/0x20 [btrfs] [<ffffffffa00bda79>] btrfs_dev_replace_start+0x339/0x590 [btrfs] [<ffffffff81196aa5>] ? __might_fault+0x95/0xa0 [<ffffffffa0078638>] btrfs_ioctl_dev_replace+0x118/0x160 [btrfs] [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70 [<ffffffffa007c914>] ? btrfs_ioctl+0x24/0x1770 [btrfs] [<ffffffffa007ce43>] btrfs_ioctl+0x553/0x1770 [btrfs] [<ffffffff811409c6>] ? stack_trace_call+0x46/0x70 [<ffffffff811d6eb1>] ? do_vfs_ioctl+0x21/0x5a0 [<ffffffff811d6f1c>] do_vfs_ioctl+0x8c/0x5a0 [<ffffffff811e3336>] ? __fget_light+0x86/0xb0 [<ffffffff811e3369>] ? __fdget+0x9/0x20 [<ffffffff811d7451>] ? SyS_ioctl+0x21/0x80 [<ffffffff811d7483>] SyS_ioctl+0x53/0x80 [<ffffffff81b1efd7>] entry_SYSCALL_64_fastpath+0x12/0x6f This is because of unprotected use of rcu_dereference in btrfs_scratch_superblocks. We can't add rcu locks around the whole function because we read the superblock. The fix will use the rcu string buffer directly without the rcu locking. Thi is safe as the device will not go away in the meantime. We're holding the device list mutexes. Restructuring the code to narrow down the rcu section turned out to be impossible, we need to call filp_open (through update_dev_time) on the buffer and this could call kmalloc/__might_sleep. We could call kstrdup with GFP_ATOMIC but it's not absolutely necessary. Fixes: 12b1c2637b6e (Btrfs: enhance btrfs_scratch_superblock to scratch all superblocks) Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: Continue replace when set_block_ro failedZhaolei1-2/+18
xfstests/011 failed in node with small_size filesystem. Can be reproduced by following script: DEV_LIST="/dev/vdd /dev/vde" DEV_REPLACE="/dev/vdf" do_test() { local mkfs_opt="$1" local size="$2" dmesg -c >/dev/null umount $SCRATCH_MNT &>/dev/null echo mkfs.btrfs -f $mkfs_opt "${DEV_LIST[*]}" mkfs.btrfs -f $mkfs_opt "${DEV_LIST[@]}" || return 1 mount "${DEV_LIST[0]}" $SCRATCH_MNT echo -n "Writing big files" dd if=/dev/urandom of=$SCRATCH_MNT/t0 bs=1M count=1 >/dev/null 2>&1 for ((i = 1; i <= size; i++)); do echo -n . /bin/cp $SCRATCH_MNT/t0 $SCRATCH_MNT/t$i || return 1 done echo echo Start replace btrfs replace start -Bf "${DEV_LIST[0]}" "$DEV_REPLACE" $SCRATCH_MNT || { dmesg return 1 } return 0 } # Set size to value near fs size # for example, 1897 can trigger this bug in 2.6G device. # ./do_test "-d raid1 -m raid1" 1897 System will report replace fail with following warning in dmesg: [ 134.710853] BTRFS: dev_replace from /dev/vdd (devid 1) to /dev/vdf started [ 135.542390] BTRFS: btrfs_scrub_dev(/dev/vdd, 1, /dev/vdf) failed -28 [ 135.543505] ------------[ cut here ]------------ [ 135.544127] WARNING: CPU: 0 PID: 4080 at fs/btrfs/dev-replace.c:428 btrfs_dev_replace_start+0x398/0x440() [ 135.545276] Modules linked in: [ 135.545681] CPU: 0 PID: 4080 Comm: btrfs Not tainted 4.3.0 #256 [ 135.546439] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 [ 135.547798] ffffffff81c5bfcf ffff88003cbb3d28 ffffffff817fe7b5 0000000000000000 [ 135.548774] ffff88003cbb3d60 ffffffff810a88f1 ffff88002b030000 00000000ffffffe4 [ 135.549774] ffff88003c080000 ffff88003c082588 ffff88003c28ab60 ffff88003cbb3d70 [ 135.550758] Call Trace: [ 135.551086] [<ffffffff817fe7b5>] dump_stack+0x44/0x55 [ 135.551737] [<ffffffff810a88f1>] warn_slowpath_common+0x81/0xc0 [ 135.552487] [<ffffffff810a89e5>] warn_slowpath_null+0x15/0x20 [ 135.553211] [<ffffffff81448c88>] btrfs_dev_replace_start+0x398/0x440 [ 135.554051] [<ffffffff81412c3e>] btrfs_ioctl+0x1d2e/0x25c0 [ 135.554722] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0 [ 135.555506] [<ffffffff8111ab36>] ? current_kernel_time64+0x56/0xa0 [ 135.556304] [<ffffffff81201e3d>] do_vfs_ioctl+0x30d/0x580 [ 135.557009] [<ffffffff8114c7ba>] ? __audit_syscall_entry+0xaa/0xf0 [ 135.557855] [<ffffffff810011d1>] ? do_audit_syscall_entry+0x61/0x70 [ 135.558669] [<ffffffff8120d1c1>] ? __fget_light+0x61/0x90 [ 135.559374] [<ffffffff81202124>] SyS_ioctl+0x74/0x80 [ 135.559987] [<ffffffff81809857>] entry_SYSCALL_64_fastpath+0x12/0x6f [ 135.560842] ---[ end trace 2a5c1fc3205abbdd ]--- Reason: When big data writen to fs, the whole free space will be allocated for data chunk. And operation as scrub need to set_block_ro(), and when there is only one metadata chunk in system(or other metadata chunks are all full), the function will try to allocate a new chunk, and failed because no space in device. Fix: When set_block_ro failed for metadata chunk, it is not a problem because scrub_lock paused commit_trancaction in same time, and metadata are always cowed, so the on-the-fly writepages will not write data into same place with scrub/replace. Let replace continue in this case is no problem. Tested by above script, and xfstests/011, plus 100 times xfstests/070. Changelog v1->v2: 1: Add detail comments in source and commit-message. 2: Add dmesg detail into commit-message. 3: Limit return value of -ENOSPC to be passed. All suggested by: Filipe Manana <fdmanana@gmail.com> Suggested-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: fix clashing number of the enhanced balance usage filterDavid Sterba1-1/+1
I've accidentally picked an already used number for the enhanced usage filter represented by BTRFS_BALANCE_ARGS_USAGE_RANGE, clashing with BTRFS_BALANCE_ARGS_CONVERT. Introduced during the development phase, no backward compatibility issues. Reported-by: Holger Hoffstätte <holger.hoffstaette@googlemail.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: bc3094673f22 ("btrfs: extend balance filter usage to take minimum and maximum") Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: fix the number of transaction units needed to remove a block groupFilipe Manana3-5/+38
We were using only 1 transaction unit when attempting to delete an unused block group but in reality we need 3 + N units, where N corresponds to the number of stripes. We were accounting only for the addition of the orphan item (for the block group's free space cache inode) but we were not accounting that we need to delete one block group item from the extent tree, one free space item from the tree of tree roots and N device extent items from the device tree. While one unit is not enough, it worked most of the time because for each single unit we are too pessimistic and assume an entire tree path, with the highest possible heigth (8), needs to be COWed with eventual node splits at every possible level in the tree, so there was usually enough reserved space for removing all the items and adding the orphan item. However after adding the orphan item, writepages() can by called by the VM subsystem against the btree inode when we are under memory pressure, which causes writeback to start for the nodes we COWed before, this forces the operation to remove the free space item to COW again some (or all of) the same nodes (in the tree of tree roots). Even without writepages() being called, we could fail with ENOSPC because these items are located in multiple trees and one of them might have a higher heigth and require node/leaf splits at many levels, exhausting all the reserved space before removing all the items and adding the orphan. In the kernel 4.0 release, commit 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group"), we attempted to fix a BUG_ON due to ENOSPC when trying to add the orphan item by making the cleaner kthread reserve one transaction unit before attempting to remove the block group, but this was not enough. We had a couple user reports still hitting the same BUG_ON after 4.0, like Stefan Priebe's report on a 4.2-rc6 kernel for example: http://www.spinics.net/lists/linux-btrfs/msg46070.html So fix this by reserving all the necessary units of metadata. Reported-by: Stefan Priebe <s.priebe@profihost.ag> Fixes: 3d84be799194 ("Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: use global reserve when deleting unused block group after ENOSPCFilipe Manana6-26/+52
It's possible to reach a state where the cleaner kthread isn't able to start a transaction to delete an unused block group due to lack of enough free metadata space and due to lack of unallocated device space to allocate a new metadata block group as well. If this happens try to use space from the global block group reserve just like we do for unlink operations, so that we don't reach a permanent state where starting a transaction for filesystem operations (file creation, renames, etc) keeps failing with -ENOSPC. Such an unfortunate state was observed on a machine where over a dozen unused data block groups existed and the cleaner kthread was failing to delete them due to ENOSPC error when attempting to start a transaction, and even running balance with a -dusage=0 filter failed with ENOSPC as well. Also unmounting and mounting again the filesystem didn't help. Allowing the cleaner kthread to use the global block reserve to delete the unused data block groups fixed the problem. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25Btrfs: tests: checking for NULL instead of IS_ERR()Dan Carpenter1-1/+3
btrfs_alloc_dummy_root() return an error pointer on failure, it never returns NULL. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25btrfs: fix signed overflows in btrfs_sync_fileDavid Sterba1-3/+7
The calculation of range length in btrfs_sync_file leads to signed overflow. This was caught by PaX gcc SIZE_OVERFLOW plugin. https://forums.grsecurity.net/viewtopic.php?f=1&t=4284 The fsync call passes 0 and LLONG_MAX, the range length does not fit to loff_t and overflows, but the value is converted to u64 so it silently works as expected. The minimal fix is a typecast to u64, switching functions to take (start, end) instead of (start, len) would be more intrusive. Coccinelle script found that there's one more opencoded calculation of the length. <smpl> @@ loff_t start, end; @@ * end - start </smpl> CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-11-25arm64: early_alloc: Fix check for allocation failureSuzuki K. Poulose1-2/+6
In early_alloc we check if the memblock_alloc failed by checking the virtual address of the result, which will never fail. This patch fixes it to check the actual result for failure. Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Suzuki K. Poulose <suzuki.poulose@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2015-11-25KEYS: Fix handling of stored error in a negatively instantiated user keyDavid Howells3-2/+10
If a user key gets negatively instantiated, an error code is cached in the payload area. A negatively instantiated key may be then be positively instantiated by updating it with valid data. However, the ->update key type method must be aware that the error code may be there. The following may be used to trigger the bug in the user key type: keyctl request2 user user "" @u keyctl add user user "a" @u which manifests itself as: BUG: unable to handle kernel paging request at 00000000ffffff8a IP: [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 kernel/rcu/tree.c:3046 PGD 7cc30067 PUD 0 Oops: 0002 [#1] SMP Modules linked in: CPU: 3 PID: 2644 Comm: a.out Not tainted 4.3.0+ #49 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff88003ddea700 ti: ffff88003dd88000 task.ti: ffff88003dd88000 RIP: 0010:[<ffffffff810a376f>] [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 [<ffffffff810a376f>] __call_rcu.constprop.76+0x1f/0x280 kernel/rcu/tree.c:3046 RSP: 0018:ffff88003dd8bdb0 EFLAGS: 00010246 RAX: 00000000ffffff82 RBX: 0000000000000000 RCX: 0000000000000001 RDX: ffffffff81e3fe40 RSI: 0000000000000000 RDI: 00000000ffffff82 RBP: ffff88003dd8bde0 R08: ffff88007d2d2da0 R09: 0000000000000000 R10: 0000000000000000 R11: ffff88003e8073c0 R12: 00000000ffffff82 R13: ffff88003dd8be68 R14: ffff88007d027600 R15: ffff88003ddea700 FS: 0000000000b92880(0063) GS:ffff88007fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000ffffff8a CR3: 000000007cc5f000 CR4: 00000000000006e0 Stack: ffff88003dd8bdf0 ffffffff81160a8a 0000000000000000 00000000ffffff82 ffff88003dd8be68 ffff88007d027600 ffff88003dd8bdf0 ffffffff810a39e5 ffff88003dd8be20 ffffffff812a31ab ffff88007d027600 ffff88007d027620 Call Trace: [<ffffffff810a39e5>] kfree_call_rcu+0x15/0x20 kernel/rcu/tree.c:3136 [<ffffffff812a31ab>] user_update+0x8b/0xb0 security/keys/user_defined.c:129 [< inline >] __key_update security/keys/key.c:730 [<ffffffff8129e5c1>] key_create_or_update+0x291/0x440 security/keys/key.c:908 [< inline >] SYSC_add_key security/keys/keyctl.c:125 [<ffffffff8129fc21>] SyS_add_key+0x101/0x1e0 security/keys/keyctl.c:60 [<ffffffff8185f617>] entry_SYSCALL_64_fastpath+0x12/0x6a arch/x86/entry/entry_64.S:185 Note the error code (-ENOKEY) in EDX. A similar bug can be tripped by: keyctl request2 trusted user "" @u keyctl add trusted user "a" @u This should also affect encrypted keys - but that has to be correctly parameterised or it will fail with EINVAL before getting to the bit that will crashes. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com> Signed-off-by: James Morris <james.l.morris@oracle.com>
2015-11-24block: fix blk_abort_request for blk-mq driversChristoph Hellwig1-3/+5
We only added the request to the request list for the !blk-mq case, so we should only delete it in that case as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-24nvme: add missing unmaps in nvme_queue_rqChristoph Hellwig1-3/+12
When we fail various metadata related operations in nvme_queue_rq we need to unmap the data SGL. Cc: stable@vger.kernel.org Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-24NVMe: default to 4k device page sizeNishanth Aravamudan1-9/+6
We received a bug report recently when DDW (64-bit direct DMA on Power) is not enabled for NVMe devices. In that case, we fall back to 32-bit DMA via the IOMMU, which is always done via 4K TCEs (Translation Control Entries). The NVMe device driver, though, assumes that the DMA alignment for the PRP entries will match the device's page size, and that the DMA aligment matches the kernel's page aligment. On Power, the the IOMMU page size, as mentioned above, can be 4K, while the device can have a page size of 8K, while the kernel has a page size of 64K. This eventually trips the BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple of 4K but not 8K (e.g., 0xF000). In this particular case of page sizes, we clearly want to use the IOMMU's page size in the driver. And generally, the NVMe driver in this function should be using the IOMMU's page size for the default device page size, rather than the kernel's page size. There is not currently an API to obtain the IOMMU's page size across all architectures and in the interest of a stop-gap fix to this functional issue, default the NVMe device page size to 4K, with the intent of adding such an API and implementation across all architectures in the next merge window. With the functionally equivalent v3 of this patch, our hardware test exerciser survives when using 32-bit DMA; without the patch, the kernel will BUG within a few minutes. Signed-off-by: Nishanth Aravamudan <nacc at linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-11-24pidns: fix NULL dereference in __task_pid_nr_ns()Eric Dumazet1-2/+2
I got a crash during a "perf top" session that was caused by a race in __task_pid_nr_ns() : pid_nr_ns() was inlined, but apparently compiler chose to read task->pids[type].pid twice, and the pid->level dereference crashed because we got a NULL pointer at the second read : if (pid && ns->level <= pid->level) { // CRASH Just use RCU API properly to solve this race, and not worry about "perf top" crashing hosts :( get_task_pid() can benefit from same fix. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-24ALSA: hda - Fix noise on Gigabyte Z170X moboTakashi Iwai1-0/+8
Gigabyte Z710X mobo with ALC1150 codec gets significant noises from the analog loopback routes even if their inputs are all muted. Simply kill the aamix for fixing it. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=108301 Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2015-11-24selinux: fix bug in conditional rules handlingStephen Smalley1-2/+2
commit fa1aa143ac4a ("selinux: extended permissions for ioctls") introduced a bug into the handling of conditional rules, skipping the processing entirely when the caller does not provide an extended permissions (xperms) structure. Access checks from userspace using /sys/fs/selinux/access do not include such a structure since that interface does not presently expose extended permission information. As a result, conditional rules were being ignored entirely on userspace access requests, producing denials when access was allowed by conditional rules in the policy. Fix the bug by only skipping computation of extended permissions in this situation, not the entire conditional rules processing. Reported-by: Laurent Bigonville <bigon@debian.org> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> [PM: fixed long lines in patch description] Cc: stable@vger.kernel.org # 4.3 Signed-off-by: Paul Moore <pmoore@redhat.com>
2015-11-24soc: Mediatek: Enable SCPSYS power domain driver by defaultEddie Huang1-0/+1
If enable Mediatek 8173 SoC, it should also enable power domain driver. Otherwise access clk subsystem register will fail. Signed-off-by: Eddie Huang <eddie.huang@mediatek.com> Acked-by: Matthias Brugger <matthias.bgg@gmail.com> Signed-off-by: Kevin Hilman <khilman@linaro.org>
2015-11-24arm64: kvm: report original PAR_EL1 upon panicMark Rutland1-1/+5
If we call __kvm_hyp_panic while a guest context is active, we call __restore_sysregs before acquiring the system register values for the panic, in the process throwing away the PAR_EL1 value at the point of the panic. This patch modifies __kvm_hyp_panic to stash the PAR_EL1 value prior to restoring host register values, enabling us to report the original values at the point of the panic. Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-11-24arm64: kvm: avoid %p in __kvm_hyp_panicMark Rutland1-1/+1
Currently __kvm_hyp_panic uses %p for values which are not pointers, such as the ESR value. This can confusingly lead to "(null)" being printed for the value. Use %x instead, and only use %p for host pointers. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
2015-11-24KVM: arm/arm64: vgic: Trust the LR state for HW IRQsChristoffer Dall1-14/+2
We were probing the physial distributor state for the active state of a HW virtual IRQ, because we had seen evidence that the LR state was not cleared when the guest deactivated a virtual interrupted. However, this issue turned out to be a software bug in the GIC, which was solved by: 84aab5e68c2a5e1e18d81ae8308c3ce25d501b29 (KVM: arm/arm64: arch_timer: Preserve physical dist. active state on LR.active, 2015-11-24) Therefore, get rid of the complexities and just look at the LR. Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>