linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2016-03-13	tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump	Len Brown	1	-1/+1
	MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x1e008008 (...pkg-cstate-limit=0: unlimited) should print as MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x1e008008 (...pkg-cstate-limit=8: unlimited) Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: call __cpuid() instead of __get_cpuid()	Len Brown	1	-7/+7
	turbostat already checks whether calling each cpuid leavf is legal, and it doesn't look at the function return value, so call the simpler gcc intrinsic __cpuid() instead of __get_cpuid(). syntax only, no functional change Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: indicate SMX and SGX support	Len Brown	1	-1/+27
	SGX presence is related to a SKL power workaround, so lets show when that is enabled. Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: detect and work around syscall jitter	Len Brown	1	-1/+50
	The accuracy of Bzy_Mhz and Busy% depend on reading the TSC, APERF, and MPERF close together in time. When there is a very short measurement interval, or a large system is profoundly idle, the changes in APERF and MPERF may be very small. They can be small enough that an expensive interrupt between reading APERF and MPERF can cause the APERF/MPERF ratio to become inaccurate, resulting in invalid calculation and display of Bzy_MHz. A dummy APERF read of APERF makes this problem much more rare. Apparently this 1st systemn call after exiting a long stretch of idle is when we typically see expensive timer interrupts that cause large jitter. For the cases that dummy APERF read fails to prevent, we compare the latency of the APERF and MPERF reads. If they differ by more than 2x, we re-issue them. Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: show GFX%rc6	Len Brown	1	-0/+45
	The column "GFX%c6" show the percentage of time the GPU is in the "render C6" state, rc6. Deep package C-states on several systems depend on the GPU being in RC6. This information comes from the counter /sys/class/drm/card0/power/rc6_residency_ms, as read before and after the measurement interval. Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: show GFXMHz	Len Brown	1	-1/+49
	Under the column "GFXMHz", show a snapshot of this attribute: /sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz This is an instantaneous snapshot of what sysfs presents at the end of the measurement interval. turbostat does not average or otherwise perform any math on this value. Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: show IRQs per CPU	Len Brown	1	-4/+122
	The new IRQ column shows how many interrupts have occurred on each CPU during the measurement inteval. This information comes from the difference between /proc/interrupts shapshots made before and after the measurement interval. The first row, the system summary, shows the sum of the IRQS for all CPUs during that interval. Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: make fewer systems calls	Len Brown	1	-10/+41
	skip the open(2)/close(2) on each msr read by keeping the /dev/cpu/*/msr files open. The remaining read(2) is generally far fewer cycles than the removed open(2) system call. Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: fix compiler warnings	Len Brown	1	-4/+4
	Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: add --out option for saving output in a file	Len Brown	2	-135/+160
	By default... Turbostat --debug gconfiguration info goes to stderr. In FORK mode, turbostat statistics go to stderr. In PERIODIC mode, turbostat statistics go to stdout. These defaults do not change, but an option "--out file" will send all output above only to the specified file. Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: re-name "%Busy" field to "Busy%"	Len Brown	2	-11/+11
	some tools processing turbostat output have difficulty with items that begin with %... Reported-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding	Hubert Chrzaniuk	1	-27/+26
	Following changes have been made: - changed MSR_NHM_TURBO_RATIO_LIMIT to MSR_TURBO_RATIO_LIMIT in debug print for consistency with Developer Manual - updated definition of bitfields in MSR_TURBO_RATIO_LIMIT and appropriate parsing code - added x200 to list of architectures that do not support Nahlem compatible definition of MSR_TURBO_RATIO_LIMIT register (x200 has the register but bits definition is custom) - fixed typo in code that parses MSR_TURBO_RATIO_LIMIT (logical instead of bitwise operator) - changed MSR_TURBO_RATIO_LIMIT parsing algorithm so the print out had the same order as implementations for other platforms Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: Intel Xeon x200: fix erroneous bclk value	Chrzaniuk, Hubert	1	-1/+1
	x200 does not enable any way to programmatically obtain bus clock speed. Bclk for the architecture has a fixed value of 100 MHz. At the same time x200 cannot be included in has_snb_msrs since it does not support C7 idle state. prior to this patch, MHz values reported on this chip were erroneously calculated using bclk of 133MHz, causing MHz values to be reported 33% higher than actual. Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-13	tools/power turbostat: allow sub-sec intervals	Len Brown	2	-5/+17
	turbostat -i interval_sec will sample and display statistics every interval_sec. interval_sec used to be a whole number of seconds, but now we accept a decimal, as small as 0.001 sec (1 ms). Signed-off-by: Len Brown <len.brown@intel.com>
2016-03-09	device property: fwnode->secondary may contain ERR_PTR(-ENODEV)	Heikki Krogerus	1	-4/+4
	This fixes BUG triggered when fwnode->secondary is not NULL, but has ERR_PTR(-ENODEV) instead. BUG: unable to handle kernel paging request at ffffffffffffffed IP: [<ffffffff81677b86>] __fwnode_property_read_string+0x26/0x160 PGD 200e067 PUD 2010067 PMD 0 Oops: 0000 [#1] SMP KASAN Modules linked in: dwc3_pci(+) dwc3 CPU: 0 PID: 1138 Comm: modprobe Not tainted 4.5.0-rc5+ #61 task: ffff88015aaf5b00 ti: ffff88007b958000 task.ti: ffff88007b958000 RIP: 0010:[<ffffffff81677b86>] [<ffffffff81677b86>] __fwnode_property_read_string+0x26/0x160 RSP: 0018:ffff88007b95eff8 EFLAGS: 00010246 RAX: fffffbfffffffffd RBX: ffffffffffffffed RCX: ffff88015999cd37 RDX: dffffc0000000000 RSI: ffffffff81e11bc0 RDI: ffffffffffffffed RBP: ffff88007b95f020 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88007b90f7cf R11: 0000000000000000 R12: ffff88007b95f0a0 R13: 00000000fffffffa R14: ffffffff81e11bc0 R15: ffff880159ea37a0 FS: 00007ff35f46c700(0000) GS:ffff88015b800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffffffffffffed CR3: 000000007b8be000 CR4: 00000000001006f0 Stack: ffff88015999cd20 ffffffff81e11bc0 ffff88007b95f0a0 ffff88007b383dd8 ffff880159ea37a0 ffff88007b95f048 ffffffff81677d03 ffff88007b952460 ffffffff81e11bc0 ffff88007b95f0a0 ffff88007b95f070 ffffffff81677d40 Call Trace: [<ffffffff81677d03>] fwnode_property_read_string+0x43/0x50 [<ffffffff81677d40>] device_property_read_string+0x30/0x40 ... Fixes: 362c0b30249e (device property: Fallback to secondary fwnode if primary misses the property) Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-08	ACPICA: Revert "Parser: Fix for SuperName method invocation"	Bob Moore	1	-5/+4
	ACPICA commit eade8f78f2aa21e8eabc3380a5728db47273bcf1 Revert commit ae90fbf562d7 (ACPICA: Parser: Fix for SuperName method invocation). Support for method invocations as part of super_name will be removed from the ACPI specification, since no AML interpreter supports it. Fixes: ae90fbf562d7 (ACPICA: Parser: Fix for SuperName method invocation) Link: https://github.com/acpica/acpica/commit/eade8f78 Signed-off-by: Bob Moore <robert.moore@intel.com> Signed-off-by: Lv Zheng <lv.zheng@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-03-06	Linux 4.5-rc7	Linus Torvalds	1	-1/+1

2016-03-05	um: use %lx format specifiers for unsigned longs	Colin Ian King	1	-2/+2
	static analysis from cppcheck detected %x being used for unsigned longs: [arch/x86/um/os-Linux/task_size.c:112]: (warning) %x in format string (no. 1) requires 'unsigned int' but the argument type is 'unsigned long'. Use %lx instead of %x Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2016-03-05	um: Export pm_power_off	Richard Weinberger	1	-0/+1
	...modules are using this symbol. Export it like all other archs to. Signed-off-by: Richard Weinberger <richard@nod.at>
2016-03-05	Revert "um: Fix get_signal() usage"	Richard Weinberger	1	-1/+1
	Commit db2f24dc240856fb1d78005307f1523b7b3c121b was plain wrong. I did not realize the we are allowed to loop here. In fact we have to loop and must not return to userspace before all SIGSEGVs have been delivered. Other archs do this directly in their entry code, UML does it here. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>
2016-03-05	ubi: Fix out of bounds write in volume update code	Richard Weinberger	1	-1/+1
	ubi_start_leb_change() allocates too few bytes. ubi_more_leb_change_data() will write up to req->upd_bytes + ubi->min_io_size bytes. Cc: stable@vger.kernel.org Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Boris Brezillon <boris.brezillon@free-electrons.com>
2016-03-04	nfit: Continue init even if ARS commands are unimplemented	Vishal Verma	1	-4/+11
	If firmware doesn't implement any of the ARS commands, take that to mean that ARS is unsupported, and continue to initialize regions without bad block lists. We cannot make the assumption that ARS commands will be unconditionally supported on all NVDIMMs. Reported-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Tested-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-03-04	ARM: 8544/1: set_memory_xx fixes	Mika Penttilä	1	-0/+3
	Allow zero size updates. This makes set_memory_xx() consistent with x86, s390 and arm64 and makes apply_to_page_range() not to BUG() when loading modules. Signed-off-by: Mika Penttilä mika.penttila@nextfour.com Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2016-03-04	MIPS: traps: Fix SIGFPE information leak from `do_ov' and `do_trap_or_bp'	Maciej W. Rozycki	1	-7/+6
	Avoid sending a partially initialised `siginfo_t' structure along SIGFPE signals issued from `do_ov' and `do_trap_or_bp', leading to information leaking from the kernel stack. Signed-off-by: Maciej W. Rozycki <macro@imgtec.com> Cc: stable@vger.kernel.org Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2016-03-04	ceph: initial CEPH_FEATURE_FS_FILE_LAYOUT_V2 support	Yan, Zheng	7	-3/+49
	Add support for the format change of MClientReply/MclientCaps. Also add code that denies access to inodes with pool_ns layouts. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-03-04	gpu: host1x: Set DMA ops on device creation	Alexandre Courbot	1	-0/+2
	Currently host1x-instanciated devices have their dma_ops left to NULL, which makes any DMA operation (like buffer import) on ARM64 fallback to the dummy_dma_ops and fail with an error. This patch calls of_dma_configure() with the host1x node when creating such a device, so the proper DMA operations are set. Suggested-by: Thierry Reding <thierry.reding@gmail.com> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Thierry Reding <treding@nvidia.com>
2016-03-04	gpu: host1x: Set DMA mask	Alexandre Courbot	2	-0/+8
	The default DMA mask covers a 32 bits address range, but host1x devices can address a larger range on TK1 and TX1. Set the DMA mask to the range addressable when we use the IOMMU to prevent the use of bounce buffers. Signed-off-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Thierry Reding <treding@nvidia.com>
2016-03-04	tracing: Do not have 'comm' filter override event 'comm' field	Steven Rostedt (Red Hat)	3	-12/+17
	Commit 9f61668073a8d "tracing: Allow triggers to filter for CPU ids and process names" added a 'comm' filter that will filter events based on the current tasks struct 'comm'. But this now hides the ability to filter events that have a 'comm' field too. For example, sched_migrate_task trace event. That has a 'comm' field of the task to be migrated. echo 'comm == "bash"' > events/sched_migrate_task/filter will now filter all sched_migrate_task events for tasks named "bash" that migrates other tasks (in interrupt context), instead of seeing when "bash" itself gets migrated. This fix requires a couple of changes. 1) Change the look up order for filter predicates to look at the events fields before looking at the generic filters. 2) Instead of basing the filter function off of the "comm" name, have the generic "comm" filter have its own filter_type (FILTER_COMM). Test against the type instead of the name to assign the filter function. 3) Add a new "COMM" filter that works just like "comm" but will filter based on the current task, even if the trace event contains a "comm" field. Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU". Cc: stable@vger.kernel.org # v4.3+ Fixes: 9f61668073a8d "tracing: Allow triggers to filter for CPU ids and process names" Reported-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-03-04	ALSA: hda - hdmi defer to register acomp eld notifier	Libin Yang	1	-12/+12
	Defer to register acomp eld notifier until hdmi audio driver is fully ready. After registering eld notifier, gfx driver can use this callback function to notify audio driver the monitor connection event. However this action may happen when audio driver is adding the pins or doing other initialization. This is not always safe, however. For example, using per_pin->lock before the lock is initialized. Let's register the eld notifier after the initialization is done. Signed-off-by: Libin Yang <libin.yang@linux.intel.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2016-03-04	ALSA: hda - hdmi add wmb barrier for audio component	Libin Yang	1	-0/+5
	To make sure audio_ptr is set before intel_audio_codec_enable() or intel_audio_codec_disable() calling pin_eld_notify(), this patch adds wmb barrier to prevent optimizing. Signed-off-by: Libin Yang <libin.yang@linux.intel.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2016-03-03	powerpc/fsl-book3e: Avoid lbarx on e5500	Scott Wood	1	-0/+13
	lbarx/stbcx. are implemented on e6500, but not on e5500. Likewise, SMT is on e6500, but not on e5500. So, avoid executing an unimplemented instruction by only locking when needed (i.e. in the presence of SMT). Signed-off-by: Scott Wood <oss@buserror.net>
2016-03-03	Btrfs: fix loading of orphan roots leading to BUG_ON	Filipe Manana	1	-1/+9
	When looking for orphan roots during mount we can end up hitting a BUG_ON() (at root-item.c:btrfs_find_orphan_roots()) if a log tree is replayed and qgroups are enabled. This is because after a log tree is replayed, a transaction commit is made, which triggers qgroup extent accounting which in turn does backref walking which ends up reading and inserting all roots in the radix tree fs_info->fs_root_radix, including orphan roots (deleted snapshots). So after the log tree is replayed, when finding orphan roots we hit the BUG_ON with the following trace: [118209.182438] ------------[ cut here ]------------ [118209.183279] kernel BUG at fs/btrfs/root-tree.c:314! [118209.184074] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [118209.185123] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic ppdev xor raid6_pq evdev sg parport_pc parport acpi_cpufreq tpm_tis tpm psmouse processor i2c_piix4 serio_raw pcspkr i2c_core button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio scsi_mod e1000 floppy [last unloaded: btrfs] [118209.186318] CPU: 14 PID: 28428 Comm: mount Tainted: G W 4.5.0-rc5-btrfs-next-24+ #1 [118209.186318] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [118209.186318] task: ffff8801ec131040 ti: ffff8800af34c000 task.ti: ffff8800af34c000 [118209.186318] RIP: 0010:[<ffffffffa04237d7>] [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs] [118209.186318] RSP: 0018:ffff8800af34faa8 EFLAGS: 00010246 [118209.186318] RAX: 00000000ffffffef RBX: 00000000ffffffef RCX: 0000000000000001 [118209.186318] RDX: 0000000080000000 RSI: 0000000000000001 RDI: 00000000ffffffff [118209.186318] RBP: ffff8800af34fb08 R08: 0000000000000001 R09: 0000000000000000 [118209.186318] R10: ffff8800af34f9f0 R11: 6db6db6db6db6db7 R12: ffff880171b97000 [118209.186318] R13: ffff8801ca9d65e0 R14: ffff8800afa2e000 R15: 0000160000000000 [118209.186318] FS: 00007f5bcb914840(0000) GS:ffff88023edc0000(0000) knlGS:0000000000000000 [118209.186318] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [118209.186318] CR2: 00007f5bcaceb5d9 CR3: 00000000b49b5000 CR4: 00000000000006e0 [118209.186318] Stack: [118209.186318] fffffbffffffffff 010230ffffffffff 0101000000000000 ff84000000000000 [118209.186318] fbffffffffffffff 30ffffffffffffff 0000000000000101 ffff880082348000 [118209.186318] 0000000000000000 ffff8800afa2e000 ffff8800afa2e000 0000000000000000 [118209.186318] Call Trace: [118209.186318] [<ffffffffa042e2db>] open_ctree+0x1e37/0x21b9 [btrfs] [118209.186318] [<ffffffffa040a753>] btrfs_mount+0x97e/0xaed [btrfs] [118209.186318] [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf [118209.186318] [<ffffffff8117b87e>] mount_fs+0x67/0x131 [118209.186318] [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde [118209.186318] [<ffffffffa0409f81>] btrfs_mount+0x1ac/0xaed [btrfs] [118209.186318] [<ffffffff8108e1c0>] ? trace_hardirqs_on+0xd/0xf [118209.186318] [<ffffffff8108c26b>] ? lockdep_init_map+0xb9/0x1b3 [118209.186318] [<ffffffff8117b87e>] mount_fs+0x67/0x131 [118209.186318] [<ffffffff81192d2b>] vfs_kern_mount+0x6c/0xde [118209.186318] [<ffffffff81195637>] do_mount+0x8a6/0x9e8 [118209.186318] [<ffffffff8119598d>] SyS_mount+0x77/0x9f [118209.186318] [<ffffffff81493017>] entry_SYSCALL_64_fastpath+0x12/0x6b [118209.186318] Code: 64 00 00 85 c0 89 c3 75 24 f0 41 80 4c 24 20 20 49 8b bc 24 f0 01 00 00 4c 89 e6 e8 e8 65 00 00 85 c0 89 c3 74 11 83 f8 ef 75 02 <0f> 0b 4c 89 e7 e8 da 72 00 00 eb 1c 41 83 bc 24 00 01 00 00 00 [118209.186318] RIP [<ffffffffa04237d7>] btrfs_find_orphan_roots+0x1fc/0x244 [btrfs] [118209.186318] RSP <ffff8800af34faa8> [118209.230735] ---[ end trace 83938f987d85d477 ]--- So fix this by not treating the error -EEXIST, returned when attempting to insert a root already inserted by the backref walking code, as an error. The following test case for xfstests reproduces the bug: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey _run_btrfs_util_prog quota enable $SCRATCH_MNT # Create 2 directories with one file in one of them. # We use these just to trigger a transaction commit later, moving the file from # directory a to directory b and doing an fsync against directory a. mkdir $SCRATCH_MNT/a mkdir $SCRATCH_MNT/b touch $SCRATCH_MNT/a/f sync # Create our test file with 2 4K extents. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foobar \| _filter_xfs_io # Create a snapshot and delete it. This doesn't really delete the snapshot # immediately, just makes it inaccessible and invisible to user space, the # snapshot is deleted later by a dedicated kernel thread (cleaner kthread) # which is woke up at the next transaction commit. # A root orphan item is inserted into the tree of tree roots, so that if a # power failure happens before the dedicated kernel thread does the snapshot # deletion, the next time the filesystem is mounted it resumes the snapshot # deletion. _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap # Now overwrite half of the extents we wrote before. Because we made a snapshpot # before, which isn't really deleted yet (since no transaction commit happened # after we did the snapshot delete request), the non overwritten extents get # referenced twice, once by the default subvolume and once by the snapshot. $XFS_IO_PROG -c "pwrite -S 0xbb 4K 8K" $SCRATCH_MNT/foobar \| _filter_xfs_io # Now move file f from directory a to directory b and fsync directory a. # The fsync on the directory a triggers a transaction commit (because a file # was moved from it to another directory) and the file fsync leaves a log tree # with file extent items to replay. mv $SCRATCH_MNT/a/f $SCRATCH_MNT/a/b $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/a $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar echo "File digest before power failure:" md5sum $SCRATCH_MNT/foobar \| _filter_scratch # Now simulate a power failure and mount the filesystem to replay the log tree. # After the log tree was replayed, we used to hit a BUG_ON() when processing # the root orphan item for the deleted snapshot. This is because when processing # an orphan root the code expected to be the first code inserting the root into # the fs_info->fs_root_radix radix tree, while in reallity it was the second # caller attempting to do it - the first caller was the transaction commit that # took place after replaying the log tree, when updating the qgroup counters. _flakey_drop_and_remount echo "File digest before after failure:" # Must match what he got before the power failure. md5sum $SCRATCH_MNT/foobar \| _filter_scratch _unmount_flakey status=0 exit Fixes: 2d9e97761087 ("Btrfs: use btrfs_get_fs_root in resolve_indirect_ref") Cc: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-03-03	block: support large requests in blk_rq_map_user_iov	Christoph Hellwig	1	-30/+61
	This patch adds support for larger requests in blk_rq_map_user_iov by allowing it to build multiple bios for a request. This functionality used to exist for the non-vectored blk_rq_map_user in the past, and this patch reuses the existing functionality for it on the unmap side, which stuck around. Thanks to the iov_iter API supporting multiple bios is fairly trivial, as we can just iterate the iov until we've consumed the whole iov_iter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Jeff Lien <Jeff.Lien@hgst.com> Tested-by: Jeff Lien <Jeff.Lien@hgst.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	block: fix blk_rq_get_max_sectors for driver private requests	Christoph Hellwig	1	-1/+1
	Driver private request types should not get the artifical cap for the FS requests. This is important to use the full device capabilities for internal command or NVMe pass through commands. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Jeff Lien <Jeff.Lien@hgst.com> Tested-by: Jeff Lien <Jeff.Lien@hgst.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Updated by me to use an explicit check for the one command type that does support extended checking, instead of relying on the ordering of the enum command values - as suggested by Keith. Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	nvme: fix max_segments integer truncation	Christoph Hellwig	1	-2/+4
	The block layer uses an unsigned short for max_segments. The way we calculate the value for NVMe tends to generate very large 32-bit values, which after integer truncation may lead to a zero value instead of the desired outcome. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Jeff Lien <Jeff.Lien@hgst.com> Tested-by: Jeff Lien <Jeff.Lien@hgst.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	nvme: set queue limits for the admin queue	Christoph Hellwig	1	-10/+19
	Factor out a helper to set all the device specific queue limits and apply them to the admin queue in addition to the I/O queues. Without this the command size on the admin queue is arbitrarily low, and the missing other limitations are just minefields waiting for victims. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Jeff Lien <Jeff.Lien@hgst.com> Tested-by: Jeff Lien <Jeff.Lien@hgst.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	writeback: flush inode cgroup wb switches instead of pinning super_block	Tejun Heo	3	-13/+47
	If cgroup writeback is in use, inodes can be scheduled for asynchronous wb switching. Before 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches"), this could race with umount leading to super_block being destroyed while inodes are pinned for wb switching. 5ff8eaac1636 fixed it by bumping s_active while wb switches are in flight; however, this allowed in-flight wb switches to make umounts asynchronous when the userland expected synchronosity - e.g. fsck immediately following umount may fail because the device is still busy. This patch removes the problematic super_block pinning and instead makes generic_shutdown_super() flush in-flight wb switches. wb switches are now executed on a dedicated isw_wq so that they can be flushed and isw_nr_in_flight keeps track of the number of in-flight wb switches so that flushing can be avoided in most cases. v2: Move cgroup_writeback_umount() further below and add MS_ACTIVE check in inode_switch_wbs() as Jan an Al suggested. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Tahsin Erdogan <tahsin@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@ZenIV.linux.org.uk> Link: http://lkml.kernel.org/g/CAAeU0aNCq7LGODvVGRU-oU_o-6enii5ey0p1c26D1ZzYwkDc5A@mail.gmail.com Fixes: 5ff8eaac1636 ("writeback: keep superblock pinned during cgroup writeback association switches") Cc: stable@vger.kernel.org #v4.5 Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	NVMe: Fix 0-length integrity payload	Keith Busch	1	-1/+1
	A user could send a passthrough IO command with a metadata pointer to a namespace without metadata. With metadata length of 0, kmalloc returns ZERO_SIZE_PTR. Since that is not NULL, the driver would have set this as the bio's integrity payload, which causes an access fault on completion. This patch ignores the users metadata buffer if the namespace format does not support separate metadata. Reported-by: Stephen Bates <stephen.bates@microsemi.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	NVMe: Don't allow unsupported flags	Keith Busch	1	-0/+4
	The command flags can change the meaning of other fields in the command that the driver is not prepared to handle. Specifically, the user could passthrough an SGL flag, causing the controller to misinterpret the PRP list the driver created, potentially corrupting memory or data. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	NVMe: Move error handling to failed reset handler	Keith Busch	3	-18/+50
	This moves failed queue handling out of the namespace removal path and into the reset failure path, fixing a hanging condition if the controller fails or link down during del_gendisk. Previously the driver had to see the controller as degraded prior to calling del_gendisk to setup the queues to fail. But, if the controller happened to fail after this, there was no task to end outstanding requests. On failure, all namespace states are set to dead. This has capacity revalidate to 0, and ends all new requests with error status. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	NVMe: Simplify device reset failure	Keith Busch	1	-27/+21
	A reset failure schedules the device to unbind from the driver through the pci driver's remove. This cleans up all intialization, so there is no need to duplicate the potentially racy cleanup. To help understand why a reset failed, the status is logged with the existing warning message. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	NVMe: Fix namespace removal deadlock	Keith Busch	3	-7/+26
	This patch makes nvme namespace removal lockless. It is up to the caller to ensure no active namespace scanning is occuring. To ensure no scan work occurs, the nvme pci driver adds a removing state to the controller device to avoid queueing scan work during removal. The work is flushed after setting the state, so no new scan work can be queued. The lockless removal allows the driver to cleanup a namespace request_queue if the controller fails during removal. Previously this could deadlock trying to acquire the namespace mutex in order to handle such events. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	NVMe: Use IDA for namespace disk naming	Keith Busch	2	-3/+14
	A namespace may be detached from a controller, but a user may be holding a reference to it. Attaching a new namespace with the same NSID will create duplicate names when using the NSID to name the disk. This patch uses an IDA that is released only when the last reference is released instead of using the namespace ID. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	NVMe: Don't unmap controller registers on reset	Keith Busch	1	-29/+42
	Unmapping the registers on reset or shutdown is not necessary. Keeping the mapping simplifies reset handling. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	block: merge: get the 1st and last bvec via helpers	Ming Lei	1	-6/+2
	This patch applies the two introduced helpers to figure out the 1st and last bvec. Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	block: get the 1st and last bvec via helpers	Ming Lei	1	-4/+9
	This patch applies the two introduced helpers to figure out the 1st and last bvec, and fixes the original way after bio splitting. Cc: stable@vger.kernel.org Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	block: check virt boundary in bio_will_gap()	Ming Lei	1	-5/+11
	In the following patch, the way for figuring out the last bvec will be changed with a bit cost introduced, so return immediately if the queue doesn't have virt boundary limit. Actually most of devices have not this limit. Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	block: bio: introduce helpers to get the 1st and last bvec	Ming Lei	1	-0/+37
	The bio passed to bio_will_gap() may be fast cloned from upper layer(dm, md, bcache, fs, ...), or from bio splitting in block core. Unfortunately bio_will_gap() just figures out the last bvec via 'bi_io_vec[prev->bi_vcnt - 1]' directly, and this way is obviously wrong. This patch introduces two helpers for getting the first and last bvec of one bio for fixing the issue. Cc: stable@vger.kernel.org Reported-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2016-03-03	[media] media: Sanitise the reserved fields of the G_TOPOLOGY IOCTL arguments	Sakari Ailus	1	-9/+9
	The argument structs are used in arrays for G_TOPOLOGY IOCTL. The arguments themselves do not need to be aligned to a power of two, but aligning them up to the largest basic type alignment (u64) on common ABIs is a good thing to do. The patch changes the size of the reserved fields to 5 or 6 u32's and aligns the size of the struct to 8 bytes so we do no longer depend on the compiler to perform the alignment. While at it, add __attribute__ ((packed)) to these structs as well. Signed-off-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
2016-03-03	[media] media.h: postpone connectors entities	Mauro Carvalho Chehab	1	-0/+8
	The representation of external connections got some heated discussions recently. As we're too close to the merge window, let's not set those entities into a stone. Signed-off-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com>