aboutsummaryrefslogtreecommitdiffstats
path: root/tools/perf/scripts/python/export-to-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2016-03-01Btrfs: do not collect ordered extents when logging that inode existsFilipe Manana1-1/+16
When logging that an inode exists, for example as part of a directory fsync operation, we were collecting any ordered extents for the inode but we ended up doing nothing with them except tagging them as processed, by setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a subsequent fsync of that inode (using the LOG_INODE_ALL mode) from collecting and processing them. This created a time window where a second fsync against the inode, using the fast path, ended up not logging the checksums for the new extents but it logged the extents since they were part of the list of modified extents. This happened because the ordered extents were not collected and checksums were not yet added to the csum tree - the ordered extents have not gone through btrfs_finish_ordered_io() yet (which is where we add them to the csum tree by calling inode.c:add_pending_csums()). So fix this by not collecting an inode's ordered extents if we are logging it with the LOG_INODE_EXISTS mode. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01Btrfs: fix race when checking if we can skip fsync'ing an inodeFilipe Manana1-4/+5
If we're about to do a fast fsync for an inode and btrfs_inode_in_log() returns false, it's possible that we had an ordered extent in progress (btrfs_finish_ordered_io() not run yet) when we noticed that the inode's last_trans field was not greater than the id of the last committed transaction, but shortly after, before we checked if there were any ongoing ordered extents, the ordered extent had just completed and removed itself from the inode's ordered tree, in which case we end up not logging the inode, losing some data if a power failure or crash happens after the fsync handler returns and before the transaction is committed. Fix this by checking first if there are any ongoing ordered extents before comparing the inode's last_trans with the id of the last committed transaction - when it completes, an ordered extent always updates the inode's last_trans before it removes itself from the inode's ordered tree (at btrfs_finish_ordered_io()). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01Btrfs: fix listxattrs not listing all xattrs packed in the same itemFilipe Manana1-24/+41
In the listxattrs handler, we were not listing all the xattrs that are packed in the same btree item, which happens when multiple xattrs have a name that when crc32c hashed produce the same checksum value. Fix this by processing them all. The following test case for xfstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/attr # real QA test starts here _supported_fs generic _supported_os Linux _require_scratch _require_attrs rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test file with a few xattrs. The first 3 xattrs have a name # that when given as input to a crc32c function result in the same checksum. # This made btrfs list only one of the xattrs through listxattrs system call # (because it packs xattrs with the same name checksum into the same btree # item). touch $SCRATCH_MNT/testfile $SETFATTR_PROG -n user.foobar -v 123 $SCRATCH_MNT/testfile $SETFATTR_PROG -n user.WvG1c1Td -v qwerty $SCRATCH_MNT/testfile $SETFATTR_PROG -n user.J3__T_Km3dVsW_ -v hello $SCRATCH_MNT/testfile $SETFATTR_PROG -n user.something -v pizza $SCRATCH_MNT/testfile $SETFATTR_PROG -n user.ping -v pong $SCRATCH_MNT/testfile # Now call getfattr with --dump, which calls the listxattrs system call. # It should list all the xattrs we have set before. $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/testfile | _filter_scratch status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01Btrfs: fix deadlock between direct IO reads and buffered writesFilipe Manana1-2/+23
While running a test with a mix of buffered IO and direct IO against the same files I hit a deadlock reported by the following trace: [11642.140352] INFO: task kworker/u32:3:15282 blocked for more than 120 seconds. [11642.142452] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.143982] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.146332] kworker/u32:3 D ffff880230ef7988 [11642.147737] systemd-journald[571]: Sent WATCHDOG=1 notification. [11642.149771] 0 15282 2 0x00000000 [11642.151205] Workqueue: btrfs-flush_delalloc btrfs_flush_delalloc_helper [btrfs] [11642.154074] ffff880230ef7988 0000000000000246 0000000000014ec0 ffff88023ec94ec0 [11642.156722] ffff880233fe8f80 ffff880230ef8000 ffff88023ec94ec0 7fffffffffffffff [11642.159205] 0000000000000002 ffffffff8147b7f9 ffff880230ef79a0 ffffffff8147b541 [11642.161403] Call Trace: [11642.162129] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.163396] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.164871] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109 [11642.167020] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.167931] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197 [11642.182320] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf [11642.183762] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33 [11642.185308] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52 [11642.186782] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102 [11642.188217] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102 [11642.189626] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39 [11642.190803] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90 [11642.192158] [<ffffffff8111829f>] __lock_page+0x66/0x68 [11642.193379] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a [11642.194831] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs] [11642.197068] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs] [11642.199188] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs] [11642.200723] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [11642.202465] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs] [11642.203836] [<ffffffff811236bc>] do_writepages+0x23/0x2c [11642.205624] [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61 [11642.207057] [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15 [11642.208529] [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs] [11642.210375] [<ffffffffa0462613>] ? btrfs_scrubparity_helper+0x140/0x33a [btrfs] [11642.212132] [<ffffffffa044f974>] btrfs_run_ordered_extent_work+0x25/0x34 [btrfs] [11642.213837] [<ffffffffa046262f>] btrfs_scrubparity_helper+0x15c/0x33a [btrfs] [11642.215457] [<ffffffffa046293b>] btrfs_flush_delalloc_helper+0xe/0x10 [btrfs] [11642.217095] [<ffffffff8106483e>] process_one_work+0x256/0x48b [11642.218324] [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7 [11642.219466] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289 [11642.220801] [<ffffffff8106a500>] kthread+0xd4/0xdc [11642.222032] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24 [11642.223190] [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70 [11642.224394] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24 [11642.226295] 2 locks held by kworker/u32:3/15282: [11642.227273] #0: ("%s-%s""btrfs", name){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b [11642.229412] #1: ((&work->normal_work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b [11642.231414] INFO: task kworker/u32:8:15289 blocked for more than 120 seconds. [11642.232872] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.234109] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.235776] kworker/u32:8 D ffff88020de5f848 0 15289 2 0x00000000 [11642.237412] Workqueue: writeback wb_workfn (flush-btrfs-481) [11642.238670] ffff88020de5f848 0000000000000246 0000000000014ec0 ffff88023ed54ec0 [11642.240475] ffff88021b1ece40 ffff88020de60000 ffff88023ed54ec0 7fffffffffffffff [11642.242154] 0000000000000002 ffffffff8147b7f9 ffff88020de5f860 ffffffff8147b541 [11642.243715] Call Trace: [11642.244390] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.245432] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.246392] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109 [11642.247479] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.248551] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197 [11642.249968] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf [11642.251043] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33 [11642.252202] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52 [11642.253210] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102 [11642.254307] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102 [11642.256118] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39 [11642.257131] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90 [11642.258200] [<ffffffff8111829f>] __lock_page+0x66/0x68 [11642.259168] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a [11642.260516] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs] [11642.261841] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs] [11642.263531] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs] [11642.264747] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [11642.266148] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs] [11642.267264] [<ffffffff811236bc>] do_writepages+0x23/0x2c [11642.268280] [<ffffffff81192a2b>] __writeback_single_inode+0xda/0x5ba [11642.269407] [<ffffffff811939f0>] writeback_sb_inodes+0x27b/0x43d [11642.270476] [<ffffffff81193c28>] __writeback_inodes_wb+0x76/0xae [11642.271547] [<ffffffff81193ea6>] wb_writeback+0x19e/0x41c [11642.272588] [<ffffffff81194821>] wb_workfn+0x201/0x341 [11642.273523] [<ffffffff81194821>] ? wb_workfn+0x201/0x341 [11642.274479] [<ffffffff8106483e>] process_one_work+0x256/0x48b [11642.275497] [<ffffffff81064f20>] worker_thread+0x1f5/0x2a7 [11642.276518] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289 [11642.277520] [<ffffffff81064d2b>] ? rescuer_thread+0x289/0x289 [11642.278517] [<ffffffff8106a500>] kthread+0xd4/0xdc [11642.279371] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24 [11642.280468] [<ffffffff8147fdef>] ret_from_fork+0x3f/0x70 [11642.281607] [<ffffffff8106a42c>] ? kthread_parkme+0x24/0x24 [11642.282604] 3 locks held by kworker/u32:8/15289: [11642.283423] #0: ("writeback"){++++.+}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b [11642.285629] #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8106474d>] process_one_work+0x165/0x48b [11642.287538] #2: (&type->s_umount_key#37){+++++.}, at: [<ffffffff81171217>] trylock_super+0x1b/0x4b [11642.289423] INFO: task fdm-stress:26848 blocked for more than 120 seconds. [11642.290547] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.291453] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.292864] fdm-stress D ffff88022c107c20 0 26848 26591 0x00000000 [11642.294118] ffff88022c107c20 000000038108affa 0000000000014ec0 ffff88023ed54ec0 [11642.295602] ffff88013ab1ca40 ffff88022c108000 ffff8800b2fc19d0 00000000000e0fff [11642.297098] ffff8800b2fc19b0 ffff88022c107c88 ffff88022c107c38 ffffffff8147b541 [11642.298433] Call Trace: [11642.298896] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.299738] [<ffffffffa045225d>] lock_extent_bits+0xfe/0x1a3 [btrfs] [11642.300833] [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44 [11642.301943] [<ffffffffa0447516>] lock_and_cleanup_extent_if_need+0x68/0x18e [btrfs] [11642.303270] [<ffffffffa04485ba>] __btrfs_buffered_write+0x238/0x4c1 [btrfs] [11642.304552] [<ffffffffa044b50a>] ? btrfs_file_write_iter+0x17c/0x408 [btrfs] [11642.305782] [<ffffffffa044b682>] btrfs_file_write_iter+0x2f4/0x408 [btrfs] [11642.306878] [<ffffffff8116e298>] __vfs_write+0x7c/0xa5 [11642.307729] [<ffffffff8116e7d1>] vfs_write+0x9d/0xe8 [11642.308602] [<ffffffff8116efbb>] SyS_write+0x50/0x7e [11642.309410] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b [11642.310403] 3 locks held by fdm-stress/26848: [11642.311108] #0: (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40 [11642.312578] #1: (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0 [11642.314170] #2: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffffa044b401>] btrfs_file_write_iter+0x73/0x408 [btrfs] [11642.316796] INFO: task fdm-stress:26849 blocked for more than 120 seconds. [11642.317842] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.318691] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.319959] fdm-stress D ffff8801964ffa68 0 26849 26591 0x00000000 [11642.321312] ffff8801964ffa68 00ff8801e9975f80 0000000000014ec0 ffff88023ed94ec0 [11642.322555] ffff8800b00b4840 ffff880196500000 ffff8801e9975f20 0000000000000002 [11642.323715] ffff8801e9975f18 ffff8800b00b4840 ffff8801964ffa80 ffffffff8147b541 [11642.325096] Call Trace: [11642.325532] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.326303] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109 [11642.327180] [<ffffffff8108ae40>] ? mark_held_locks+0x5e/0x74 [11642.328114] [<ffffffff8147f30e>] ? _raw_spin_unlock_irq+0x2c/0x4a [11642.329051] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197 [11642.330053] [<ffffffff8147bceb>] __wait_for_common+0x109/0x147 [11642.330952] [<ffffffff8147bceb>] ? __wait_for_common+0x109/0x147 [11642.331869] [<ffffffff8147e7bb>] ? usleep_range+0x4a/0x4a [11642.332925] [<ffffffff81074075>] ? wake_up_q+0x47/0x47 [11642.333736] [<ffffffff8147bd4d>] wait_for_completion+0x24/0x26 [11642.334672] [<ffffffffa044f5ce>] btrfs_wait_ordered_extents+0x1c8/0x217 [btrfs] [11642.335858] [<ffffffffa0465b5a>] btrfs_mksubvol+0x224/0x45d [btrfs] [11642.336854] [<ffffffff81082eef>] ? add_wait_queue_exclusive+0x44/0x44 [11642.337820] [<ffffffffa0465edb>] btrfs_ioctl_snap_create_transid+0x148/0x17a [btrfs] [11642.339026] [<ffffffffa046603b>] btrfs_ioctl_snap_create_v2+0xc7/0x110 [btrfs] [11642.340214] [<ffffffffa0468582>] btrfs_ioctl+0x590/0x27bd [btrfs] [11642.341123] [<ffffffff8147dc00>] ? mutex_unlock+0xe/0x10 [11642.341934] [<ffffffffa00fa6e9>] ? ext4_file_write_iter+0x2a3/0x36f [ext4] [11642.342936] [<ffffffff8108895d>] ? __lock_is_held+0x3c/0x57 [11642.343772] [<ffffffff81186a1d>] ? rcu_read_unlock+0x3e/0x5d [11642.344673] [<ffffffff8117dc95>] do_vfs_ioctl+0x458/0x4dc [11642.346024] [<ffffffff81186bbe>] ? __fget_light+0x62/0x71 [11642.346873] [<ffffffff8117dd70>] SyS_ioctl+0x57/0x79 [11642.347720] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b [11642.350222] 4 locks held by fdm-stress/26849: [11642.350898] #0: (sb_writers#11){.+.+.+}, at: [<ffffffff811706ee>] __sb_start_write+0x5f/0xb0 [11642.352375] #1: (&type->i_mutex_dir_key#4/1){+.+.+.}, at: [<ffffffffa0465981>] btrfs_mksubvol+0x4b/0x45d [btrfs] [11642.354072] #2: (&fs_info->subvol_sem){++++..}, at: [<ffffffffa0465a2a>] btrfs_mksubvol+0xf4/0x45d [btrfs] [11642.355647] #3: (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs] [11642.357516] INFO: task fdm-stress:26850 blocked for more than 120 seconds. [11642.358508] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.359376] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.368625] fdm-stress D ffff88021f167688 0 26850 26591 0x00000000 [11642.369716] ffff88021f167688 0000000000000001 0000000000014ec0 ffff88023edd4ec0 [11642.370950] ffff880128a98680 ffff88021f168000 ffff88023edd4ec0 7fffffffffffffff [11642.372210] 0000000000000002 ffffffff8147b7f9 ffff88021f1676a0 ffffffff8147b541 [11642.373430] Call Trace: [11642.373853] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.374623] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.375948] [<ffffffff8147e7fe>] schedule_timeout+0x43/0x109 [11642.376862] [<ffffffff8147b7f9>] ? bit_wait+0x2f/0x2f [11642.377637] [<ffffffff8108afd1>] ? trace_hardirqs_on_caller+0x17b/0x197 [11642.378610] [<ffffffff8108affa>] ? trace_hardirqs_on+0xd/0xf [11642.379457] [<ffffffff810b079b>] ? timekeeping_get_ns+0xe/0x33 [11642.380366] [<ffffffff810b0f61>] ? ktime_get+0x41/0x52 [11642.381353] [<ffffffff8147ac08>] io_schedule_timeout+0xa0/0x102 [11642.382255] [<ffffffff8147ac08>] ? io_schedule_timeout+0xa0/0x102 [11642.383162] [<ffffffff8147b814>] bit_wait_io+0x1b/0x39 [11642.383945] [<ffffffff8147bb21>] __wait_on_bit_lock+0x4c/0x90 [11642.384875] [<ffffffff8111829f>] __lock_page+0x66/0x68 [11642.385749] [<ffffffff81082f29>] ? autoremove_wake_function+0x3a/0x3a [11642.386721] [<ffffffffa0450ddd>] lock_page+0x31/0x34 [btrfs] [11642.387596] [<ffffffffa0454e3b>] extent_write_cache_pages.isra.19.constprop.35+0x1af/0x2f4 [btrfs] [11642.389030] [<ffffffffa0455373>] extent_writepages+0x4b/0x5c [btrfs] [11642.389973] [<ffffffff810a25ad>] ? rcu_read_lock_sched_held+0x61/0x69 [11642.390939] [<ffffffffa043c913>] ? btrfs_writepage_start_hook+0xce/0xce [btrfs] [11642.392271] [<ffffffffa0451c32>] ? __clear_extent_bit+0x26e/0x2c0 [btrfs] [11642.393305] [<ffffffffa043aa82>] btrfs_writepages+0x28/0x2a [btrfs] [11642.394239] [<ffffffff811236bc>] do_writepages+0x23/0x2c [11642.395045] [<ffffffff811198c9>] __filemap_fdatawrite_range+0x5a/0x61 [11642.395991] [<ffffffff81119946>] filemap_fdatawrite_range+0x13/0x15 [11642.397144] [<ffffffffa044f87e>] btrfs_start_ordered_extent+0xd0/0x1a1 [btrfs] [11642.398392] [<ffffffffa0452094>] ? clear_extent_bit+0x17/0x19 [btrfs] [11642.399363] [<ffffffffa0445945>] btrfs_get_blocks_direct+0x12b/0x61c [btrfs] [11642.400445] [<ffffffff8119f7a1>] ? dio_bio_add_page+0x3d/0x54 [11642.401309] [<ffffffff8119fa93>] ? submit_page_section+0x7b/0x111 [11642.402213] [<ffffffff811a0258>] do_blockdev_direct_IO+0x685/0xc24 [11642.403139] [<ffffffffa044581a>] ? btrfs_page_exists_in_range+0x1a1/0x1a1 [btrfs] [11642.404360] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs] [11642.406187] [<ffffffff811a0828>] __blockdev_direct_IO+0x31/0x33 [11642.407070] [<ffffffff811a0828>] ? __blockdev_direct_IO+0x31/0x33 [11642.407990] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs] [11642.409192] [<ffffffffa043b4ca>] btrfs_direct_IO+0x1c7/0x27e [btrfs] [11642.410146] [<ffffffffa043d267>] ? btrfs_get_extent_fiemap+0x1c0/0x1c0 [btrfs] [11642.411291] [<ffffffff81119a2c>] generic_file_read_iter+0x89/0x4e1 [11642.412263] [<ffffffff8108ac05>] ? mark_lock+0x24/0x201 [11642.413057] [<ffffffff8116e1f8>] __vfs_read+0x79/0x9d [11642.413897] [<ffffffff8116e6f1>] vfs_read+0x8f/0xd2 [11642.414708] [<ffffffff8116ef3d>] SyS_read+0x50/0x7e [11642.415573] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b [11642.416572] 1 lock held by fdm-stress/26850: [11642.417345] #0: (&f->f_pos_lock){+.+.+.}, at: [<ffffffff811877e8>] __fdget_pos+0x3a/0x40 [11642.418703] INFO: task fdm-stress:26851 blocked for more than 120 seconds. [11642.419698] Not tainted 4.4.0-rc6-btrfs-next-21+ #1 [11642.420612] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11642.421807] fdm-stress D ffff880196483d28 0 26851 26591 0x00000000 [11642.422878] ffff880196483d28 00ff8801c8f60740 0000000000014ec0 ffff88023ed94ec0 [11642.424149] ffff8801c8f60740 ffff880196484000 0000000000000246 ffff8801c8f60740 [11642.425374] ffff8801bb711840 ffff8801bb711878 ffff880196483d40 ffffffff8147b541 [11642.426591] Call Trace: [11642.427013] [<ffffffff8147b541>] schedule+0x82/0x9a [11642.427856] [<ffffffff8147b6d5>] schedule_preempt_disabled+0x18/0x24 [11642.428852] [<ffffffff8147c23a>] mutex_lock_nested+0x1d7/0x3b4 [11642.429743] [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs] [11642.430911] [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs] [11642.432102] [<ffffffffa044f674>] ? btrfs_wait_ordered_roots+0x57/0x191 [btrfs] [11642.433259] [<ffffffffa044f456>] ? btrfs_wait_ordered_extents+0x50/0x217 [btrfs] [11642.434431] [<ffffffffa044f6ea>] btrfs_wait_ordered_roots+0xcd/0x191 [btrfs] [11642.436079] [<ffffffffa0410cab>] btrfs_sync_fs+0xe0/0x1ad [btrfs] [11642.437009] [<ffffffff81197900>] ? SyS_tee+0x23c/0x23c [11642.437860] [<ffffffff81197920>] sync_fs_one_sb+0x20/0x22 [11642.438723] [<ffffffff81171435>] iterate_supers+0x75/0xc2 [11642.439597] [<ffffffff81197d00>] sys_sync+0x52/0x80 [11642.440454] [<ffffffff8147fa97>] entry_SYSCALL_64_fastpath+0x12/0x6b [11642.441533] 3 locks held by fdm-stress/26851: [11642.442370] #0: (&type->s_umount_key#37){+++++.}, at: [<ffffffff8117141f>] iterate_supers+0x5f/0xc2 [11642.444043] #1: (&fs_info->ordered_operations_mutex){+.+...}, at: [<ffffffffa044f661>] btrfs_wait_ordered_roots+0x44/0x191 [btrfs] [11642.446010] #2: (&root->ordered_extent_mutex){+.+...}, at: [<ffffffffa044f456>] btrfs_wait_ordered_extents+0x50/0x217 [btrfs] This happened because under specific timings the path for direct IO reads can deadlock with concurrent buffered writes. The diagram below shows how this happens for an example file that has the following layout: [ extent A ] [ extent B ] [ .... 0K 4K 8K CPU 1 CPU 2 CPU 3 DIO read against range [0K, 8K[ starts btrfs_direct_IO() --> calls btrfs_get_blocks_direct() which finds the extent map for the extent A and leaves the range [0K, 4K[ locked in the inode's io tree buffered write against range [4K, 8K[ starts __btrfs_buffered_write() --> dirties page at 4K a user space task calls sync for e.g or writepages() is invoked by mm writepages() run_delalloc_range() cow_file_range() --> ordered extent X for the buffered write is created and writeback starts --> calls btrfs_get_blocks_direct() again, without submitting first a bio for reading extent A, and finds the extent map for extent B --> calls lock_extent_direct() --> locks range [4K, 8K[ --> finds ordered extent X covering range [4K, 8K[ --> unlocks range [4K, 8K[ buffered write against range [0K, 8K[ starts __btrfs_buffered_write() prepare_pages() --> locks pages with offsets 0 and 4K lock_and_cleanup_extent_if_need() --> blocks attempting to lock range [0K, 8K[ in the inode's io tree, because the range [0, 4K[ is already locked by the direct IO task at CPU 1 --> calls btrfs_start_ordered_extent(oe X) btrfs_start_ordered_extent(oe X) --> At this point writeback for ordered extent X has not finished yet filemap_fdatawrite_range() btrfs_writepages() extent_writepages() extent_write_cache_pages() --> finds page with offset 0 with the writeback tag (and not dirty) --> tries to lock it --> deadlock, task at CPU 2 has the page locked and is blocked on the io range [0, 4K[ that was locked earlier by this task So fix this by falling back to a buffered read in the direct IO read path when an ordered extent for a buffered write is found. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01Btrfs: fix extent_same allowing destination offset beyond i_sizeFilipe Manana1-0/+3
When using the same file as the source and destination for a dedup (extent_same ioctl) operation we were allowing it to dedup to a destination offset beyond the file's size, which doesn't make sense and it's not allowed for the case where the source and destination files are not the same file. This made de deduplication operation successful only when the source range corresponded to a hole, a prealloc extent or an extent with all bytes having a value of 0x00. This was also leaving a file hole (between i_size and destination offset) without the corresponding file extent items, which can be reproduced with the following steps for example: $ mkfs.btrfs -f /dev/sdi $ mount /dev/sdi /mnt/sdi $ xfs_io -f -c "pwrite -S 0xab 304457 404990" /mnt/sdi/foobar wrote 404990/404990 bytes at offset 304457 395 KiB, 99 ops; 0.0000 sec (31.150 MiB/sec and 7984.5149 ops/sec) $ /git/hub/duperemove/btrfs-extent-same 24576 /mnt/sdi/foobar 28672 /mnt/sdi/foobar 929792 Deduping 2 total files (28672, 24576): /mnt/sdi/foobar (929792, 24576): /mnt/sdi/foobar 1 files asked to be deduped i: 0, status: 0, bytes_deduped: 24576 24576 total bytes deduped in this operation $ umount /mnt/sdi $ btrfsck /dev/sdi Checking filesystem on /dev/sdi UUID: 98c528aa-0833-427d-9403-b98032ffbf9d checking extents checking free space cache checking fs roots root 5 inode 257 errors 100, file extent discount Found file extent holes: start: 712704, len: 217088 found 540673 bytes used err is 1 total csum bytes: 400 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123675 file data blocks allocated: 671744 referenced 671744 btrfs-progs v4.2.3 So fix this by not allowing the destination to go beyond the file's size, just as we do for the same where the source and destination files are not the same. A test for xfstests follows. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01Btrfs: fix file loss on log replay after renaming a file and fsyncFilipe Manana2-12/+59
We have two cases where we end up deleting a file at log replay time when we should not. For this to happen the file must have been renamed and a directory inode must have been fsynced/logged. Two examples that exercise these two cases are listed below. Case 1) $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir -p /mnt/a/b $ mkdir /mnt/c $ touch /mnt/a/b/foo $ sync $ mv /mnt/a/b/foo /mnt/c/ # Create file bar just to make sure the fsync on directory a/ does # something and it's not a no-op. $ touch /mnt/a/bar $ xfs_io -c "fsync" /mnt/a < power fail / crash > The next time the filesystem is mounted, the log replay procedure deletes file foo. Case 2) $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/a $ mkdir /mnt/b $ mkdir /mnt/c $ touch /mnt/a/foo $ ln /mnt/a/foo /mnt/b/foo_link $ touch /mnt/b/bar $ sync $ unlink /mnt/b/foo_link $ mv /mnt/b/bar /mnt/c/ $ xfs_io -c "fsync" /mnt/a/foo < power fail / crash > The next time the filesystem is mounted, the log replay procedure deletes file bar. The reason why the files are deleted is because when we log inodes other then the fsync target inode, we ignore their last_unlink_trans value and leave the log without enough information to later replay the rename operations. So we need to look at the last_unlink_trans values and fallback to a transaction commit if they are greater than the id of the last committed transaction. So fix this by looking at the last_unlink_trans values and fallback to transaction commits when needed. Also, when logging other inodes (for case 1 we logged descendants of the fsync target inode while for case 2 we logged ascendants) we need to care about concurrent tasks updating the last_unlink_trans of inodes we are logging (which was already an existing problem in check_parent_dirs_for_sync()). Since we can not acquire their inode mutex (vfs' struct inode ->i_mutex), as that causes deadlocks with other concurrent operations that acquire the i_mutex of 2 inodes (other fsyncs or renames for example), we need to serialize on the log_mutex of the inode we are logging. A task setting a new value for an inode's last_unlink_trans must acquire the inode's log_mutex and it must do this update before doing the actual unlink operation (which is already the case except when deleting a snapshot). Conversely the task logging the inode must first log the inode and then check the inode's last_unlink_trans value while holding its log_mutex, as if its value is not greater then the id of the last committed transaction it means it logged a safe state of the inode's items, while if its value is not smaller then the id of the last committed transaction it means the inode state it has logged might not be safe (the concurrent task might have just updated last_unlink_trans but hasn't done yet the unlink operation) and therefore a transaction commit must be done. Test cases for xfstests follow in separate patches. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-03-01Btrfs: fix unreplayable log after snapshot delete + parent dir fsyncFilipe Manana3-0/+20
If we delete a snapshot, fsync its parent directory and crash/power fail before the next transaction commit, on the next mount when we attempt to replay the log tree of the root containing the parent directory we will fail and prevent the filesystem from mounting, which is solvable by wiping out the log trees with the btrfs-zero-log tool but very inconvenient as we will lose any data and metadata fsynced before the parent directory was fsynced. For example: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ mkdir /mnt/testdir $ btrfs subvolume snapshot /mnt /mnt/testdir/snap $ btrfs subvolume delete /mnt/testdir/snap $ xfs_io -c "fsync" /mnt/testdir < crash / power failure and reboot > $ mount /dev/sdc /mnt mount: mount(2) failed: No such file or directory And in dmesg/syslog we get the following message and trace: [192066.361162] BTRFS info (device dm-0): failed to delete reference to snap, inode 257 parent 257 [192066.363010] ------------[ cut here ]------------ [192066.365268] WARNING: CPU: 4 PID: 5130 at fs/btrfs/inode.c:3986 __btrfs_unlink_inode+0x17a/0x354 [btrfs]() [192066.367250] BTRFS: Transaction aborted (error -2) [192066.368401] Modules linked in: btrfs dm_flakey dm_mod ppdev sha256_generic xor raid6_pq hmac drbg ansi_cprng aesni_intel acpi_cpufreq tpm_tis aes_x86_64 tpm ablk_helper evdev cryptd sg parport_pc i2c_piix4 psmouse lrw parport i2c_core pcspkr gf128mul processor serio_raw glue_helper button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs] [192066.377154] CPU: 4 PID: 5130 Comm: mount Tainted: G W 4.4.0-rc6-btrfs-next-20+ #1 [192066.378875] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [192066.380889] 0000000000000000 ffff880143923670 ffffffff81257570 ffff8801439236b8 [192066.382561] ffff8801439236a8 ffffffff8104ec07 ffffffffa039dc2c 00000000fffffffe [192066.384191] ffff8801ed31d000 ffff8801b9fc9c88 ffff8801086875e0 ffff880143923710 [192066.385827] Call Trace: [192066.386373] [<ffffffff81257570>] dump_stack+0x4e/0x79 [192066.387387] [<ffffffff8104ec07>] warn_slowpath_common+0x99/0xb2 [192066.388429] [<ffffffffa039dc2c>] ? __btrfs_unlink_inode+0x17a/0x354 [btrfs] [192066.389236] [<ffffffff8104ec68>] warn_slowpath_fmt+0x48/0x50 [192066.389884] [<ffffffffa039dc2c>] __btrfs_unlink_inode+0x17a/0x354 [btrfs] [192066.390621] [<ffffffff81184b55>] ? iput+0xb0/0x266 [192066.391200] [<ffffffffa039ea25>] btrfs_unlink_inode+0x1c/0x3d [btrfs] [192066.391930] [<ffffffffa03ca623>] check_item_in_log+0x1fe/0x29b [btrfs] [192066.392715] [<ffffffffa03ca827>] replay_dir_deletes+0x167/0x1cf [btrfs] [192066.393510] [<ffffffffa03cccc7>] replay_one_buffer+0x417/0x570 [btrfs] [192066.394241] [<ffffffffa03ca164>] walk_up_log_tree+0x10e/0x1dc [btrfs] [192066.394958] [<ffffffffa03cac72>] walk_log_tree+0xa5/0x190 [btrfs] [192066.395628] [<ffffffffa03ce8b8>] btrfs_recover_log_trees+0x239/0x32c [btrfs] [192066.396790] [<ffffffffa03cc8b0>] ? replay_one_extent+0x50a/0x50a [btrfs] [192066.397891] [<ffffffffa0394041>] open_ctree+0x1d8b/0x2167 [btrfs] [192066.398897] [<ffffffffa03706e1>] btrfs_mount+0x5ef/0x729 [btrfs] [192066.399823] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf [192066.400739] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3 [192066.401700] [<ffffffff811714b9>] mount_fs+0x67/0x131 [192066.402482] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde [192066.403930] [<ffffffffa03702bd>] btrfs_mount+0x1cb/0x729 [btrfs] [192066.404831] [<ffffffff8108ad98>] ? trace_hardirqs_on+0xd/0xf [192066.405726] [<ffffffff8108959b>] ? lockdep_init_map+0xb9/0x1b3 [192066.406621] [<ffffffff811714b9>] mount_fs+0x67/0x131 [192066.407401] [<ffffffff81188560>] vfs_kern_mount+0x6c/0xde [192066.408247] [<ffffffff8118ae36>] do_mount+0x893/0x9d2 [192066.409047] [<ffffffff8113009b>] ? strndup_user+0x3f/0x8c [192066.409842] [<ffffffff8118b187>] SyS_mount+0x75/0xa1 [192066.410621] [<ffffffff8147e517>] entry_SYSCALL_64_fastpath+0x12/0x6b [192066.411572] ---[ end trace 2de42126c1e0a0f0 ]--- [192066.412344] BTRFS: error (device dm-0) in __btrfs_unlink_inode:3986: errno=-2 No such entry [192066.413748] BTRFS: error (device dm-0) in btrfs_replay_log:2464: errno=-2 No such entry (Failed to recover log tree) [192066.415458] BTRFS error (device dm-0): cleaner transaction attach returned -30 [192066.444613] BTRFS: open_ctree failed This happens because when we are replaying the log and processing the directory entry pointing to the snapshot in the subvolume tree, we treat its btrfs_dir_item item as having a location with a key type matching BTRFS_INODE_ITEM_KEY, which is wrong because the type matches BTRFS_ROOT_ITEM_KEY and therefore must be processed differently, as the object id refers to a root number and not to an inode in the root containing the parent directory. So fix this by triggering a transaction commit if an fsync against the parent directory is requested after deleting a snapshot. This is the simplest approach for a rare use case. Some alternative that avoids the transaction commit would require more code to explicitly delete the snapshot at log replay time (factoring out common code from ioctl.c: btrfs_ioctl_snap_destroy()), special care at fsync time to remove the log tree of the snapshot's root from the log root of the root of tree roots, amongst other steps. A test case for xfstests that triggers the issue follows. seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { _cleanup_flakey cd / rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_target flakey _require_metadata_journaling $SCRATCH_DEV rm -f $seqres.full _scratch_mkfs >>$seqres.full 2>&1 _init_flakey _mount_flakey # Create a snapshot at the root of our filesystem (mount point path), delete it, # fsync the mount point path, crash and mount to replay the log. This should # succeed and after the filesystem is mounted the snapshot should not be visible # anymore. _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap1 _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/snap1 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT _flakey_drop_and_remount [ -e $SCRATCH_MNT/snap1 ] && \ echo "Snapshot snap1 still exists after log replay" # Similar scenario as above, but this time the snapshot is created inside a # directory and not directly under the root (mount point path). mkdir $SCRATCH_MNT/testdir _run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/testdir/snap2 _run_btrfs_util_prog subvolume delete $SCRATCH_MNT/testdir/snap2 $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir _flakey_drop_and_remount [ -e $SCRATCH_MNT/testdir/snap2 ] && \ echo "Snapshot snap2 still exists after log replay" _unmount_flakey echo "Silence is golden" status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2016-02-28Linux 4.5-rc6Linus Torvalds1-1/+1
2016-02-27do_last(): ELOOP failure exit should be done after leaving RCU modeAl Viro1-5/+4
... or we risk seeing a bogus value of d_is_symlink() there. Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27should_follow_link(): validate ->d_seq after having decided to followAl Viro1-0/+5
... otherwise d_is_symlink() above might have nothing to do with the inode value we've got. Cc: stable@vger.kernel.org # v4.2+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27namei: ->d_inode of a pinned dentry is stable only for positivesAl Viro1-2/+2
both do_last() and walk_component() risk picking a NULL inode out of dentry about to become positive, *then* checking its flags and seeing that it's not negative anymore and using (already stale by then) value they'd fetched earlier. Usually ends up oopsing soon after that... Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27do_last(): don't let a bogus return value from ->open() et.al. to confuse usAl Viro1-0/+4
... into returning a positive to path_openat(), which would interpret that as "symlink had been encountered" and proceed to corrupt memory, etc. It can only happen due to a bug in some ->open() instance or in some LSM hook, etc., so we report any such event *and* make sure it doesn't trick us into further unpleasantness. Cc: stable@vger.kernel.org # v3.6+, at least Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27fs: return -EOPNOTSUPP if clone is not supportedChristoph Hellwig1-2/+4
-EBADF is a rather confusing error if an operations is not supported, and nfsd gets rather upset about it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27hpfs: don't truncate the file when delete failsMikulas Patocka1-28/+3
The delete opration can allocate additional space on the HPFS filesystem due to btree split. The HPFS driver checks in advance if there is available space, so that it won't corrupt the btree if we run out of space during splitting. If there is not enough available space, the HPFS driver attempted to truncate the file, but this results in a deadlock since the commit 7dd29d8d865efdb00c0542a5d2c87af8c52ea6c7 ("HPFS: Introduce a global mutex and lock it on every callback from VFS"). This patch removes the code that tries to truncate the file and -ENOSPC is returned instead. If the user hits -ENOSPC on delete, he should try to delete other files (that are stored in a leaf btree node), so that the delete operation will make some space for deleting the file stored in non-leaf btree node. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: stable@vger.kernel.org # 2.6.39+ Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-02-27ext2, ext4: fix issue with missing journal entry in ext4_dax_mkwrite()Ross Zwisler2-35/+3
As it is currently written ext4_dax_mkwrite() assumes that the call into __dax_mkwrite() will not have to do a block allocation so it doesn't create a journal entry. For a read that creates a zero page to cover a hole followed by a write that actually allocates storage this is incorrect. The ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls get_blocks() to allocate storage. Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault() as this function already has all the logic needed to allocate a journal entry and call __dax_fault(). Also update the ext2 fault handlers in this same way to remove duplicate code and keep the logic between ext2 and ext4 the same. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2016-02-27dax: move writeback calls into the filesystemsRoss Zwisler7-16/+43
Previously calls to dax_writeback_mapping_range() for all DAX filesystems (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range(). dax_writeback_mapping_range() needs a struct block_device, and it used to get that from inode->i_sb->s_bdev. This is correct for normal inodes mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time files. Instead, call dax_writeback_mapping_range() directly from the filesystem ->writepages function so that it can supply us with a valid block device. This also fixes DAX code to properly flush caches in response to sync(2). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27dax: give DAX clearing code correct bdevRoss Zwisler6-10/+13
dax_clear_blocks() needs a valid struct block_device and previously it was using inode->i_sb->s_bdev in all cases. This is correct for normal inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for DAX raw block devices and for XFS real-time devices. Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change its arguments to take a bdev and a sector instead of an inode and a block. This better reflects what the function does, and it allows the filesystem and raw block device code to pass in an appropriate struct block_device. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27ext4: online defrag not supported with DAXRoss Zwisler1-0/+5
Online defrag operations for ext4 are hard coded to use the page cache. See ext4_ioctl() -> ext4_move_extents() -> move_extent_per_page() When combined with DAX I/O, which circumvents the page cache, this can result in data corruption. This was observed with xfstests ext4/307 and ext4/308. Fix this by only allowing online defrag for non-DAX files. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27ext2, ext4: only set S_DAX for regular inodesRoss Zwisler2-2/+2
When S_DAX is set on an inode we assume that if there are pages attached to the mapping (mapping->nrpages != 0), those pages are clean zero pages that were used to service reads from holes. Any dirty data associated with the inode should be in the form of DAX exceptional entries (mapping->nrexceptional) that is written back via dax_writeback_mapping_range(). With the current code, though, this isn't always true. For example, ext2 and ext4 directory inodes can have S_DAX set, but have their dirty data stored as dirty page cache entries. For these types of inodes, having S_DAX set doesn't really make sense since their I/O doesn't actually happen through the DAX code path. Instead, only allow S_DAX to be set for regular inodes for ext2 and ext4. This allows us to have strict DAX vs non-DAX paths in the writeback code. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27block: disable block device DAX by defaultDan Williams2-1/+18
The recent *sync enabling discovered that we are inserting into the block_device pagecache counter to the expectations of the dirty data tracking for dax mappings. This can lead to data corruption. We want to support DAX for block devices eventually, but it requires wider changes to properly manage the pagecache. dump_stack+0x85/0xc2 dax_writeback_mapping_range+0x60/0xe0 blkdev_writepages+0x3f/0x50 do_writepages+0x21/0x30 __filemap_fdatawrite_range+0xc6/0x100 filemap_write_and_wait+0x4a/0xa0 set_blocksize+0x70/0xd0 sb_set_blocksize+0x1d/0x50 ext4_fill_super+0x75b/0x3360 mount_bdev+0x180/0x1b0 ext4_mount+0x15/0x20 mount_fs+0x38/0x170 Mark the support broken so its disabled by default, but otherwise still available for testing. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@fb.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27ocfs2: unlock inode if deleting inode from orphan failsGuozhonghua1-0/+1
When doing append direct io cleanup, if deleting inode fails, it goes out without unlocking inode, which will cause the inode deadlock. This issue was introduced by commit cf1776a9e834 ("ocfs2: fix a tiny race when truncate dio orohaned entry"). Signed-off-by: Guozhonghua <guozhonghua@h3c.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> [4.2+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27mm: ASLR: use get_random_long()Daniel Cashman8-14/+14
Replace calls to get_random_int() followed by a cast to (unsigned long) with calls to get_random_long(). Also address shifting bug which, in case of x86 removed entropy mask for mmap_rnd_bits values > 31 bits. Signed-off-by: Daniel Cashman <dcashman@android.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: David S. Miller <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Nick Kralevich <nnk@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Mark Salyzyn <salyzyn@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27drivers: char: random: add get_random_long()Daniel Cashman2-0/+23
Commit d07e22597d1d ("mm: mmap: add new /proc tunable for mmap_base ASLR") added the ability to choose from a range of values to use for entropy count in generating the random offset to the mmap_base address. The maximum value on this range was set to 32 bits for 64-bit x86 systems, but this value could be increased further, requiring more than the 32 bits of randomness provided by get_random_int(), as is already possible for arm64. Add a new function: get_random_long() which more naturally fits with the mmap usage of get_random_int() but operates exactly the same as get_random_int(). Also, fix the shifting constant in mmap_rnd() to be an unsigned long so that values greater than 31 bits generate an appropriate mask without overflow. This is especially important on x86, as its shift instruction uses a 5-bit mask for the shift operand, which meant that any value for mmap_rnd_bits over 31 acts as a no-op and effectively disables mmap_base randomization. Finally, replace calls to get_random_int() with get_random_long() where appropriate. This patch (of 2): Add get_random_long(). Signed-off-by: Daniel Cashman <dcashman@android.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: David S. Miller <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Nick Kralevich <nnk@google.com> Cc: Jeff Vander Stoep <jeffv@google.com> Cc: Mark Salyzyn <salyzyn@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27mm: numa: quickly fail allocations for NUMA balancing on full nodesMel Gorman1-1/+1
Commit 4167e9b2cf10 ("mm: remove GFP_THISNODE") removed the GFP_THISNODE flag combination due to confusing semantics. It noted that alloc_misplaced_dst_page() was one such user after changes made by commit e97ca8e5b864 ("mm: fix GFP_THISNODE callers and clarify"). Unfortunately when GFP_THISNODE was removed, users of alloc_misplaced_dst_page() started waking kswapd and entering direct reclaim because the wrong GFP flags are cleared. The consequence is that workloads that used to fit into memory now get reclaimed which is addressed by this patch. The problem can be demonstrated with "mutilate" that exercises memcached which is software dedicated to memory object caching. The configuration uses 80% of memory and is run 3 times for varying numbers of clients. The results on a 4-socket NUMA box are mutilate 4.4.0 4.4.0 vanilla numaswap-v1 Hmean 1 8394.71 ( 0.00%) 8395.32 ( 0.01%) Hmean 4 30024.62 ( 0.00%) 34513.54 ( 14.95%) Hmean 7 32821.08 ( 0.00%) 70542.96 (114.93%) Hmean 12 55229.67 ( 0.00%) 93866.34 ( 69.96%) Hmean 21 39438.96 ( 0.00%) 85749.21 (117.42%) Hmean 30 37796.10 ( 0.00%) 50231.49 ( 32.90%) Hmean 47 18070.91 ( 0.00%) 38530.13 (113.22%) The metric is queries/second with the more the better. The results are way outside of the noise and the reason for the improvement is obvious from some of the vmstats 4.4.0 4.4.0 vanillanumaswap-v1r1 Minor Faults 1929399272 2146148218 Major Faults 19746529 3567 Swap Ins 57307366 9913 Swap Outs 50623229 17094 Allocation stalls 35909 443 DMA allocs 0 0 DMA32 allocs 72976349 170567396 Normal allocs 5306640898 5310651252 Movable allocs 0 0 Direct pages scanned 404130893 799577 Kswapd pages scanned 160230174 0 Kswapd pages reclaimed 55928786 0 Direct pages reclaimed 1843936 41921 Page writes file 2391 0 Page writes anon 50623229 17094 The vanilla kernel is swapping like crazy with large amounts of direct reclaim and kswapd activity. The figures are aggregate but it's known that the bad activity is throughout the entire test. Note that simple streaming anon/file memory consumers also see this problem but it's not as obvious. In those cases, kswapd is awake when it should not be. As there are at least two reclaim-related bugs out there, it's worth spelling out the user-visible impact. This patch only addresses bugs related to excessive reclaim on NUMA hardware when the working set is larger than a NUMA node. There is a bug related to high kswapd CPU usage but the reports are against laptops and other UMA hardware and is not addressed by this patch. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [4.1+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEEDAndrea Arcangeli1-2/+12
pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were introduced to locklessy (but atomically) detect when a pmd is a regular (stable) pmd or when the pmd is unstable and can infinitely transition from pmd_none() and pmd_trans_huge() from under us, while only holding the mmap_sem for reading (for writing not). While holding the mmap_sem only for reading, MADV_DONTNEED can run from under us and so before we can assume the pmd to be a regular stable pmd we need to compare it against pmd_none() and pmd_trans_huge() in an atomic way, with pmd_trans_unstable(). The old pmd_trans_huge() left a tiny window for a race. Useful applications are unlikely to notice the difference as doing MADV_DONTNEED concurrently with a page fault would lead to undefined behavior. [akpm@linux-foundation.org: tidy up comment grammar/layout] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-27PCI: mvebu: Restrict build to 32-bit ARMThierry Reding1-0/+1
This driver uses PCI glue that is only available on 32-bit ARM. This used to work fine as long as ARCH_MVEBU and ARCH_DOVE were exclusively 32-bit, but there's a patch in the pipe to make ARCH_MVEBU also available on 64-bit ARM. [bhelgaas: changelog; patch is coming but not merged yet] Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
2016-02-27Revert "PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()"Bjorn Helgaas5-16/+37
991de2e59090 ("PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()") appeared in v4.3 and helps support IOAPIC hotplug. Олег reported that the Elcus-1553 TA1-PCI driver worked in v4.2 but not v4.3 and bisected it to 991de2e59090. Sunjin reported that the RocketRAID 272x driver worked in v4.2 but not v4.3. In both cases booting with "pci=routirq" is a workaround. I think the problem is that after 991de2e59090, we no longer call pcibios_enable_irq() for upstream bridges. Prior to 991de2e59090, when a driver called pci_enable_device(), we recursively called pcibios_enable_irq() for upstream bridges via pci_enable_bridge(). After 991de2e59090, we call pcibios_enable_irq() from pci_device_probe() instead of the pci_enable_device() path, which does *not* call pcibios_enable_irq() for upstream bridges. Revert 991de2e59090 to fix these driver regressions. Link: https://bugzilla.kernel.org/show_bug.cgi?id=111211 Fixes: 991de2e59090 ("PCI, x86: Implement pcibios_alloc_irq() and pcibios_free_irq()") Reported-and-tested-by: Олег Мороз <oleg.moroz@mcc.vniiem.ru> Reported-by: Sunjin Yang <fan4326@gmail.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> CC: Jiang Liu <jiang.liu@linux.intel.com>
2016-02-26x86/mpx: Fix off-by-one comparison with nr_registersColin Ian King1-1/+1
In the unlikely event that regno == nr_registers then we get an array overrun on regoff because the invalid register check is currently off-by-one. Fix this with a check that regno is >= nr_registers instead. Detected with static analysis using CoverityScan. Fixes: fcc7ffd67991 "x86, mpx: Decode MPX instruction to get bound violation information" Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1456512931-3388-1-git-send-email-colin.king@canonical.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-26ALSA: hda - Loop interrupt handling until really clearedTakashi Iwai3-23/+33
Currently the interrupt handler of HD-audio driver assumes that no irq update is needed while processing the irq. But in reality, it has been confirmed that the HW irq is issued even during the irq handling. Since we clear the irq status at the beginning, process the interrupt, then exits from the handler, the lately issued interrupt is left untouched without being properly processed. This patch changes the interrupt handler code to loop over the check-and-process. The handler tries repeatedly as long as the IRQ status are turned on, and either stream or CORB/RIRB is handled. For checking the stream handling, snd_hdac_bus_handle_stream_irq() returns a value indicating the stream indices bits. Other than that, the change is only in the irq handler itself. Reported-by: Libin Yang <libin.yang@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2016-02-25x86/mm: Fix slow_virt_to_phys() for X86_PAE againDexuan Cui1-4/+10
"d1cd12108346: x86, pageattr: Prevent overflow in slow_virt_to_phys() for X86_PAE" was unintentionally removed by the recent "34437e67a672: x86/mm: Fix slow_virt_to_phys() to handle large PAT bit". And, the variable 'phys_addr' was defined as "unsigned long" by mistake -- it should be "phys_addr_t". As a result, Hyper-V network driver in 32-PAE Linux guest can't work again. Fixes: commit 34437e67a672: "x86/mm: Fix slow_virt_to_phys() to handle large PAT bit" Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Cc: olaf@aepfle.de Cc: gregkh@linuxfoundation.org Cc: jasowang@redhat.com Cc: driverdev-devel@linuxdriverproject.org Cc: linux-mm@kvack.org Cc: apw@canonical.com Cc: Andrew Morton <akpm@linux-foundation.org> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Link: http://lkml.kernel.org/r/1456394292-9030-1-git-send-email-decui@microsoft.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25ALSA: hda - Fix headset support and noise on HP EliteBook 755 G2Takashi Iwai1-0/+8
HP EliteBook 755 G2 with ALC3228 (ALC280) codec [103c:221c] requires the known fixup (ALC269_FIXUP_HEADSET_MIC) for making the headset mic working. Also, it suffers from the loopback noise problem, so we should disable aamix path as well. Reported-by: Derick Eddington <derick.eddington@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2016-02-25ALSA: hda - Fixup speaker pass-through control for nid 0x14 on ALC225David Henningsson1-2/+21
On one of the machines we enable, we found that the actual speaker volume did not always correspond to the volume set in alsamixer. This patch fixes that problem. This patch was orginally written by Kailang @ Realtek, I've rebased it to fit sound git master. Cc: stable@vger.kernel.org BugLink: https://bugs.launchpad.net/bugs/1549660 Co-Authored-By: Kailang <kailang@realtek.com> Signed-off-by: David Henningsson <david.henningsson@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2016-02-25KVM: x86: MMU: fix ubsan index-out-of-range warningMike Krinkin1-1/+1
Ubsan reports the following warning due to a typo in update_accessed_dirty_bits template, the patch fixes the typo: [ 168.791851] ================================================================================ [ 168.791862] UBSAN: Undefined behaviour in arch/x86/kvm/paging_tmpl.h:252:15 [ 168.791866] index 4 is out of range for type 'u64 [4]' [ 168.791871] CPU: 0 PID: 2950 Comm: qemu-system-x86 Tainted: G O L 4.5.0-rc5-next-20160222 #7 [ 168.791873] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 07/09/2013 [ 168.791876] 0000000000000000 ffff8801cfcaf208 ffffffff81c9f780 0000000041b58ab3 [ 168.791882] ffffffff82eb2cc1 ffffffff81c9f6b4 ffff8801cfcaf230 ffff8801cfcaf1e0 [ 168.791886] 0000000000000004 0000000000000001 0000000000000000 ffffffffa1981600 [ 168.791891] Call Trace: [ 168.791899] [<ffffffff81c9f780>] dump_stack+0xcc/0x12c [ 168.791904] [<ffffffff81c9f6b4>] ? _atomic_dec_and_lock+0xc4/0xc4 [ 168.791910] [<ffffffff81da9e81>] ubsan_epilogue+0xd/0x8a [ 168.791914] [<ffffffff81daafa2>] __ubsan_handle_out_of_bounds+0x15c/0x1a3 [ 168.791918] [<ffffffff81daae46>] ? __ubsan_handle_shift_out_of_bounds+0x2bd/0x2bd [ 168.791922] [<ffffffff811287ef>] ? get_user_pages_fast+0x2bf/0x360 [ 168.791954] [<ffffffffa1794050>] ? kvm_largepages_enabled+0x30/0x30 [kvm] [ 168.791958] [<ffffffff81128530>] ? __get_user_pages_fast+0x360/0x360 [ 168.791987] [<ffffffffa181b818>] paging64_walk_addr_generic+0x1b28/0x2600 [kvm] [ 168.792014] [<ffffffffa1819cf0>] ? init_kvm_mmu+0x1100/0x1100 [kvm] [ 168.792019] [<ffffffff8129e350>] ? debug_check_no_locks_freed+0x350/0x350 [ 168.792044] [<ffffffffa1819cf0>] ? init_kvm_mmu+0x1100/0x1100 [kvm] [ 168.792076] [<ffffffffa181c36d>] paging64_gva_to_gpa+0x7d/0x110 [kvm] [ 168.792121] [<ffffffffa181c2f0>] ? paging64_walk_addr_generic+0x2600/0x2600 [kvm] [ 168.792130] [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792178] [<ffffffffa17d9a4a>] emulator_read_write_onepage+0x27a/0x1150 [kvm] [ 168.792208] [<ffffffffa1794d44>] ? __kvm_read_guest_page+0x54/0x70 [kvm] [ 168.792234] [<ffffffffa17d97d0>] ? kvm_task_switch+0x160/0x160 [kvm] [ 168.792238] [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792263] [<ffffffffa17daa07>] emulator_read_write+0xe7/0x6d0 [kvm] [ 168.792290] [<ffffffffa183b620>] ? em_cr_write+0x230/0x230 [kvm] [ 168.792314] [<ffffffffa17db005>] emulator_write_emulated+0x15/0x20 [kvm] [ 168.792340] [<ffffffffa18465f8>] segmented_write+0xf8/0x130 [kvm] [ 168.792367] [<ffffffffa1846500>] ? em_lgdt+0x20/0x20 [kvm] [ 168.792374] [<ffffffffa14db512>] ? vmx_read_guest_seg_ar+0x42/0x1e0 [kvm_intel] [ 168.792400] [<ffffffffa1846d82>] writeback+0x3f2/0x700 [kvm] [ 168.792424] [<ffffffffa1846990>] ? em_sidt+0xa0/0xa0 [kvm] [ 168.792449] [<ffffffffa185554d>] ? x86_decode_insn+0x1b3d/0x4f70 [kvm] [ 168.792474] [<ffffffffa1859032>] x86_emulate_insn+0x572/0x3010 [kvm] [ 168.792499] [<ffffffffa17e71dd>] x86_emulate_instruction+0x3bd/0x2110 [kvm] [ 168.792524] [<ffffffffa17e6e20>] ? reexecute_instruction.part.110+0x2e0/0x2e0 [kvm] [ 168.792532] [<ffffffffa14e9a81>] handle_ept_misconfig+0x61/0x460 [kvm_intel] [ 168.792539] [<ffffffffa14e9a20>] ? handle_pause+0x450/0x450 [kvm_intel] [ 168.792546] [<ffffffffa15130ea>] vmx_handle_exit+0xd6a/0x1ad0 [kvm_intel] [ 168.792572] [<ffffffffa17f6a6c>] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm] [ 168.792597] [<ffffffffa17f6bcd>] kvm_arch_vcpu_ioctl_run+0xd3d/0x6090 [kvm] [ 168.792621] [<ffffffffa17f6a6c>] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm] [ 168.792627] [<ffffffff8293b530>] ? __ww_mutex_lock_interruptible+0x1630/0x1630 [ 168.792651] [<ffffffffa17f5e90>] ? kvm_arch_vcpu_runnable+0x4f0/0x4f0 [kvm] [ 168.792656] [<ffffffff811eeb30>] ? preempt_notifier_unregister+0x190/0x190 [ 168.792681] [<ffffffffa17e0447>] ? kvm_arch_vcpu_load+0x127/0x650 [kvm] [ 168.792704] [<ffffffffa178e9a3>] kvm_vcpu_ioctl+0x553/0xda0 [kvm] [ 168.792727] [<ffffffffa178e450>] ? vcpu_put+0x40/0x40 [kvm] [ 168.792732] [<ffffffff8129e350>] ? debug_check_no_locks_freed+0x350/0x350 [ 168.792735] [<ffffffff82946087>] ? _raw_spin_unlock+0x27/0x40 [ 168.792740] [<ffffffff8163a943>] ? handle_mm_fault+0x1673/0x2e40 [ 168.792744] [<ffffffff8129daa8>] ? trace_hardirqs_on_caller+0x478/0x6c0 [ 168.792747] [<ffffffff8129dcfd>] ? trace_hardirqs_on+0xd/0x10 [ 168.792751] [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792756] [<ffffffff81725a80>] do_vfs_ioctl+0x1b0/0x12b0 [ 168.792759] [<ffffffff817258d0>] ? ioctl_preallocate+0x210/0x210 [ 168.792763] [<ffffffff8174aef3>] ? __fget+0x273/0x4a0 [ 168.792766] [<ffffffff8174acd0>] ? __fget+0x50/0x4a0 [ 168.792770] [<ffffffff8174b1f6>] ? __fget_light+0x96/0x2b0 [ 168.792773] [<ffffffff81726bf9>] SyS_ioctl+0x79/0x90 [ 168.792777] [<ffffffff82946880>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 168.792780] ================================================================================ Signed-off-by: Mike Krinkin <krinkin.m.u@gmail.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-25ALSA: hda - Fixing background noise on Dell Inspiron 3162Kai-Heng Feng1-0/+8
After login to the desktop on Dell Inspiron 3162, there's a very loud background noise comes from the builtin speaker. The noise does not go away even if the speaker is muted. The noise disappears after using the aamix fixup. Codec: Realtek ALC3234 Address: 0 AFG Function Id: 0x1 (unsol 1) Vendor Id: 0x10ec0255 Subsystem Id: 0x10280725 Revision Id: 0x100002 No Modem Function Group found BugLink: http://bugs.launchpad.net/bugs/1549620 Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
2016-02-25perf: Robustify task_function_call()Peter Zijlstra1-20/+20
Since there is no serialization between task_function_call() doing task_curr() and the other CPU doing context switches, we could end up not sending an IPI even if we had to. And I'm not sure I still buy my own argument we're OK. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174948.340031200@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Fix scaling vs. perf_install_in_context()Peter Zijlstra1-45/+70
Completely reworks perf_install_in_context() (again!) in order to ensure that there will be no ctx time hole between add_event_to_ctx() and any potential ctx_sched_in(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174948.279399438@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Fix scaling vs. perf_event_enable()Peter Zijlstra1-19/+23
Similar to the perf_enable_on_exec(), ensure that event timings are consistent across perf_event_enable(). Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174948.218288698@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Fix scaling vs. perf_event_enable_on_exec()Peter Zijlstra1-0/+1
The recent commit 3e349507d12d ("perf: Fix perf_enable_on_exec() event scheduling") caused this by moving task_ctx_sched_out() from before __perf_event_mask_enable() to after it. The overlooked consequence of that change is that task_ctx_sched_out() would update the ctx time fields, and now __perf_event_mask_enable() uses stale time. In order to fix this, explicitly stop our context's time before enabling the event(s). Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Fixes: 3e349507d12d ("perf: Fix perf_enable_on_exec() event scheduling") Link: http://lkml.kernel.org/r/20160224174948.159242158@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Fix ctx time tracking by introducing EVENT_TIMEPeter Zijlstra1-12/+30
Currently any ctx_sched_in() call will re-start the ctx time tracking, this means that calls like: ctx_sched_in(.event_type = EVENT_PINNED); ctx_sched_in(.event_type = EVENT_FLEXIBLE); will have a hole in their ctx time tracking. This is likely harmless but can confuse things a little. By adding EVENT_TIME, we can have the first ctx_sched_in() (is_active: 0 -> !0) start the time and any further ctx_sched_in() will leave the timestamps alone. Secondly, this allows for an early disable like: ctx_sched_out(.event_type = EVENT_TIME); which would update the ctx time (if the ctx is active) and any further calls to ctx_sched_out() would not further modify the ctx time. For ctx_sched_in() any 0 -> !0 transition will automatically include EVENT_TIME. For ctx_sched_out(), any transition that clears EVENT_ALL will automatically clear EVENT_TIME. These two rules ensure that under normal circumstances we need not bother with EVENT_TIME and get natural ctx time behaviour. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174948.100446561@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Cure event->pending_disable racePeter Zijlstra1-3/+3
Because event_sched_out() checks event->pending_disable _before_ actually disabling the event, it can happen that the event fires after it checks but before it gets disabled. This would leave event->pending_disable set and the queued irq_work will try and process it. However, if the event trigger was during schedule(), the event might have been de-scheduled by the time the irq_work runs, and perf_event_disable_local() will fail. Fix this by checking event->pending_disable _after_ we call event->pmu->del(). This depends on the latter being a compiler barrier, such that the compiler does not lift the load and re-creates the problem. Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174948.040469884@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Fix race between event install and jump_labelsPeter Zijlstra2-11/+44
perf_install_in_context() relies upon the context switch hooks to have scheduled in events when the IPI misses its target -- after all, if the task has moved from the CPU (or wasn't running at all), it will have to context switch to run elsewhere. This however doesn't appear to be happening. It is possible for the IPI to not happen (task wasn't running) only to later observe the task running with an inactive context. The only possible explanation is that the context switch hooks are not called. Therefore put in a sync_sched() after toggling the jump_label to guarantee all CPUs will have them enabled before we install an event. A simple if (0->1) sync_sched() will not in fact work, because any further increment can race and complete before the sync_sched(). Therefore we must jump through some hoops. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174947.980211985@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Fix cloningPeter Zijlstra2-15/+15
Alexander reported that when the 'original' context gets destroyed, no new clones happen. This can happen irrespective of the ctx switch optimization, any task can die, even the parent, and we want to continue monitoring the task hierarchy until we either close the event or no tasks are left in the hierarchy. perf_event_init_context() will attempt to pin the 'parent' context during clone(). At that point current is the parent, and since current cannot have exited while executing clone(), its context cannot have passed through perf_event_exit_task_context(). Therefore perf_pin_task_context() cannot observe ctx->task == TASK_TOMBSTONE. However, since inherit_event() does: if (parent_event->parent) parent_event = parent_event->parent; it looks at the 'original' event when it does: is_orphaned_event(). This can return true if the context that contains the this event has passed through perf_event_exit_task_context(). And thus we'll fail to clone the perf context. Fix this by adding a new state: STATE_DEAD, which is set by perf_release() to indicate that the filedesc (or kernel reference) is dead and there are no observers for our data left. Only for STATE_DEAD will is_orphaned_event() be true and inhibit cloning. STATE_EXIT is otherwise preserved such that is_event_hup() remains functional and will report when the observed task hierarchy becomes empty. Reported-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Fixes: c6e5b73242d2 ("perf: Synchronously clean up child events") Link: http://lkml.kernel.org/r/20160224174947.919845295@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Only update context time when activePeter Zijlstra1-6/+6
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174947.860690919@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Allow perf_release() with !event->ctxPeter Zijlstra1-3/+13
In the err_file: fput(event_file) case, the event will not yet have been attached to a context. However perf_release() does assume it has been. Cure this. Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174947.793996260@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Do not double freePeter Zijlstra1-1/+6
In case of: err_file: fput(event_file), we'll end up calling perf_release() which in turn will free the event. Do not then free the event _again_. Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174947.697350349@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25perf: Close install vs. exit racePeter Zijlstra1-9/+26
Consider the following scenario: CPU0 CPU1 ctx = find_get_ctx(); perf_event_exit_task_context() mutex_lock(&ctx->mutex); perf_install_in_context(ctx, ...); /* NO-OP */ mutex_unlock(&ctx->mutex); ... perf_release() WARN_ON_ONCE(event->state != STATE_EXIT); Since the event doesn't pass through perf_remove_from_context() because perf_install_in_context() NO-OPs because the ctx is dead, and perf_event_exit_task_context() will not observe the event because its not attached yet, the event->state will not be set. Solve this by revalidating ctx->task after we acquire ctx->mutex and failing the event creation as a whole. Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174947.626853419@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25x86/entry/compat: Add missing CLAC to entry_INT80_32Andy Lutomirski1-0/+1
This doesn't seem to fix a regression -- I don't think the CLAC was ever there. I double-checked in a debugger: entries through the int80 gate do not automatically clear AC. Stable maintainers: I can provide a backport to 4.3 and earlier if needed. This needs to be backported all the way to 3.10. Reported-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> # v3.10 and later Fixes: 63bcff2a307b ("x86, smap: Add STAC and CLAC instructions to control user space access") Link: http://lkml.kernel.org/r/b02b7e71ae54074be01fc171cbd4b72517055c0e.1456345086.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-25drm/nouveau/disp/dp: ensure sink is powered up before attempting link trainingBen Skeggs2-0/+16
This can happen under some annoying circumstances, and is a quick fix until more substantial changes can be made. Fixed eDP mode changes on (at least) the Lenovo P50. Signed-off-by: Ben Skeggs <bskeggs@redhat.com> Cc: stable@vger.kernel.org
2016-02-25drm/nouveau: platform: Fix deferred probeThierry Reding2-12/+30
The error cleanup paths aren't quite correct and will crash upon deferred probe. Cc: stable@vger.kernel.org # v4.3+ Reviewed-by: Ben Skeggs <bskeggs@redhat.com> Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Thierry Reding <treding@nvidia.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-02-25drivers: sh: Restore legacy clock domain on SuperH platformsGeert Uytterhoeven1-1/+1
CONFIG_ARCH_SHMOBILE is not only enabled for Renesas ARM platforms (which are DT based and multi-platform), but also on a select set of Renesas SuperH platforms (SH7722/SH7723/SH7724/SH7343/SH7366). Hence since commit 0ba58de231066e47 ("drivers: sh: Get rid of CONFIG_ARCH_SHMOBILE_MULTI"), the legacy clock domain is no longer installed on these SuperH platforms, and module clocks may not be enabled when needed, leading to driver failures. To fix this, add an additional check for CONFIG_OF. Fixes: 0ba58de231066e47 ("drivers: sh: Get rid of CONFIG_ARCH_SHMOBILE_MULTI"). Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>