aboutsummaryrefslogtreecommitdiffstats
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-02-07mm: make mm->pinned_vm an atomic64 counterDavidlohr Bueso1-1/+1
Taking a sleeping lock to _only_ increment a variable is quite the overkill, and pretty much all users do this. Furthermore, some drivers (ie: infiniband and scif) that need pinned semantics can go to quite some trouble to actually delay via workqueue (un)accounting for pinned pages when not possible to acquire it. By making the counter atomic we no longer need to hold the mmap_sem and can simply some code around it for pinned_vm users. The counter is 64-bit such that we need not worry about overflows such as rdma user input controlled from userspace. Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-03Merge tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds4-38/+71
Pull btrfs fixes from David Sterba: - regression fix: transaction commit can run away due to delayed ref waiting heuristic, this is not necessary now because of the proper reservation mechanism introduced in 5.0 - regression fix: potential crash due to use-before-check of an ERR_PTR return value - fix for transaction abort during transaction commit that needs to properly clean up pending block groups - fix deadlock during b-tree node/leaf splitting, when this happens on some of the fundamental trees, we must prevent new tree block allocation to re-enter indirectly via the block group flushing path - potential memory leak after errors during mount * tag 'for-5.0-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: On error always free subvol_name in btrfs_mount btrfs: clean up pending block groups when transaction commit aborts btrfs: fix potential oops in device_list_add btrfs: don't end the transaction for delayed refs in throttle Btrfs: fix deadlock when allocating tree block during leaf/node split
2019-02-02Merge branch 'akpm' (patches from Andrew)Linus Torvalds6-4/+36
Merge misc fixes from Andrew Morton: "24 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (24 commits) autofs: fix error return in autofs_fill_super() autofs: drop dentry reference only when it is never used fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() mm: migrate: don't rely on __PageMovable() of newpage after unlocking it psi: clarify the Kconfig text for the default-disable option mm, memory_hotplug: __offline_pages fix wrong locking mm: hwpoison: use do_send_sig_info() instead of force_sig() kasan: mark file common so ftrace doesn't trace it init/Kconfig: fix grammar by moving a closing parenthesis lib/test_kmod.c: potential double free in error handling mm, oom: fix use-after-free in oom_kill_process mm/hotplug: invalid PFNs from pfn_to_online_page() mm,memory_hotplug: fix scan_movable_pages() for gigantic hugepages psi: fix aggregation idle shut-off mm, memory_hotplug: test_pages_in_a_zone do not pass the end of zone mm, memory_hotplug: is_mem_section_removable do not pass the end of a zone oom, oom_reaper: do not enqueue same task twice mm: migrate: make buffer_migrate_page_norefs() actually succeed kernel/exit.c: release ptraced tasks before zap_pid_ns_processes x86_64: increase stack size for KASAN_EXTRA ...
2019-02-01Merge tag '5.0-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds5-29/+61
Pull smb3 fixes from Steve French: "SMB3 fixes, some from this week's SMB3 test evemt, 5 for stable and a particularly important one for queryxattr (see xfstests 70 and 117)" * tag '5.0-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module version number CIFS: fix use-after-free of the lease keys CIFS: Do not consider -ENODATA as stat failure for reads CIFS: Do not count -ENODATA as failure for query directory CIFS: Fix trace command logging for SMB2 reads and writes CIFS: Fix possible oops and memory leaks in async IO cifs: limit amount of data we request for xattrs to CIFSMaxBufSize cifs: fix computation for MAX_SMB2_HDR_SIZE
2019-02-01autofs: fix error return in autofs_fill_super()Ian Kent1-1/+3
In autofs_fill_super() on error of get inode/make root dentry the return should be ENOMEM as this is the only failure case of the called functions. Link: http://lkml.kernel.org/r/154725123240.11260.796773942606871359.stgit@pluto-themaw-net Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01autofs: drop dentry reference only when it is never usedPan Bian1-1/+2
autofs_expire_run() calls dput(dentry) to drop the reference count of dentry. However, dentry is read via autofs_dentry_ino(dentry) after that. This may result in a use-free-bug. The patch drops the reference count of dentry only when it is never used. Link: http://lkml.kernel.org/r/154725122396.11260.16053424107144453867.stgit@pluto-themaw-net Signed-off-by: Pan Bian <bianpan2016@163.com> Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01fs/drop_caches.c: avoid softlockups in drop_pagecache_sb()Jan Kara1-1/+7
When superblock has lots of inodes without any pagecache (like is the case for /proc), drop_pagecache_sb() will iterate through all of them without dropping sb->s_inode_list_lock which can lead to softlockups (one of our customers hit this). Fix the problem by going to the slow path and doing cond_resched() in case the process needs rescheduling. Link: http://lkml.kernel.org/r/20190114085343.15011-1-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01proc: fix /proc/net/* after setns(2)Alexey Dobriyan3-1/+24
/proc entries under /proc/net/* can't be cached into dcache because setns(2) can change current net namespace. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: avoid vim miscolorization] [adobriyan@gmail.com: write test, add dummy ->d_revalidate hook: necessary if /proc/net/* is pinned at setns time] Link: http://lkml.kernel.org/r/20190108192350.GA12034@avx2 Link: http://lkml.kernel.org/r/20190107162336.GA9239@avx2 Fixes: 1da4d377f943fe4194ffb9fb9c26cc58fad4dd24 ("proc: revalidate misc dentries") Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reported-by: Mateusz Stępień <mateusz.stepien@netrounds.com> Reported-by: Ahmad Fatoum <a.fatoum@pengutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-02-01Merge tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-7/+30
Pull iomap fixes from Darrick Wong: "A couple of iomap fixes to eliminate some memory corruption and hang problems that were reported: - fix page migration when using iomap for pagecache management - fix a use-after-free bug in the directio code" * tag 'iomap-5.0-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: fix a use after free in iomap_dio_rw iomap: get/put the page in iomap_page_create/release()
2019-01-31gfs2: Revert "Fix loop in gfs2_rbm_find"Andreas Gruenbacher1-1/+1
This reverts commit 2d29f6b96d8f80322ed2dd895bca590491c38d34. It turns out that the fix can lead to a ~20 percent performance regression in initial writes to the page cache according to iozone. Let's revert this for now to have more time for a proper fix. Cc: stable@vger.kernel.org # v3.13+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-31Merge tag 'nfs-for-5.0-3' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds2-4/+10
Pull NFS client fixes from Anna Schumaker: "This addresses two bugs, one in the error code handling of nfs_page_async_flush() and one to fix a potential NULL pointer dereference in nfs_parse_devname(). Stable bugfix: - Fix up return value on fatal errors in nfs_page_async_flush() Other bugfix: - Fix NULL pointer dereference of dev_name" * tag 'nfs-for-5.0-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: Fix up return value on fatal errors in nfs_page_async_flush() nfs: Fix NULL pointer dereference of dev_name
2019-01-31cifs: update internal module version numberSteve French1-1/+1
To 2.17 Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-31CIFS: fix use-after-free of the lease keysAurelien Aptel1-2/+2
The request buffers are freed right before copying the pointers. Use the func args instead which are identical and still valid. Simple reproducer (requires KASAN enabled) on a cifs mount: echo foo > foo ; tail -f foo & rm foo Cc: <stable@vger.kernel.org> # 4.20 Fixes: 179e44d49c2f ("smb3: add tracepoint for sending lease break responses to server") Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Paulo Alcantara <palcantara@suse.de>
2019-01-30fs/dcache: Track & report number of negative dentriesWaiman Long1-0/+32
The current dentry number tracking code doesn't distinguish between positive & negative dentries. It just reports the total number of dentries in the LRU lists. As excessive number of negative dentries can have an impact on system performance, it will be wise to track the number of positive and negative dentries separately. This patch adds tracking for the total number of negative dentries in the system LRU lists and reports it in the 5th field in the /proc/sys/fs/dentry-state file. The number, however, does not include negative dentries that are in flight but not in the LRU yet as well as those in the shrinker lists which are on the way out anyway. The number of positive dentries in the LRU lists can be roughly found by subtracting the number of negative dentries from the unused count. Matthew Wilcox had confirmed that since the introduction of the dentry_stat structure in 2.1.60, the dummy array was there, probably for future extension. They were not replacements of pre-existing fields. So no sane applications that read the value of /proc/sys/fs/dentry-state will do dummy thing if the last 2 fields of the sysctl parameter are not zero. IOW, it will be safe to use one of the dummy array entry for negative dentry count. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-30fs/dcache: Fix incorrect nr_dentry_unused accounting in shrink_dcache_sb()Waiman Long1-5/+1
The nr_dentry_unused per-cpu counter tracks dentries in both the LRU lists and the shrink lists where the DCACHE_LRU_LIST bit is set. The shrink_dcache_sb() function moves dentries from the LRU list to a shrink list and subtracts the dentry count from nr_dentry_unused. This is incorrect as the nr_dentry_unused count will also be decremented in shrink_dentry_list() via d_shrink_del(). To fix this double decrement, the decrement in the shrink_dcache_sb() function is taken out. Fixes: 4e717f5c1083 ("list_lru: remove special case function list_lru_dispose_all." Cc: stable@kernel.org Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-30btrfs: On error always free subvol_name in btrfs_mountEric W. Biederman1-0/+3
The subvol_name is allocated in btrfs_parse_subvol_options and is consumed and freed in mount_subvol. Add a free to the error paths that don't call mount_subvol so that it is guaranteed that subvol_name is freed when an error happens. Fixes: 312c89fbca06 ("btrfs: cleanup btrfs_mount() using btrfs_mount_root()") Cc: stable@vger.kernel.org # v4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-30btrfs: clean up pending block groups when transaction commit abortsDavid Sterba1-0/+16
The fstests generic/475 stresses transaction aborts and can reveal space accounting or use-after-free bugs regarding block goups. In this case the pending block groups that remain linked to the structures after transaction commit aborts in the middle. The corrupted slabs lead to failures in following tests, eg. generic/476 [ 8172.752887] BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 [ 8172.755799] #PF error: [normal kernel read fault] [ 8172.757571] PGD 661ae067 P4D 661ae067 PUD 3db8e067 PMD 0 [ 8172.759000] Oops: 0000 [#1] PREEMPT SMP [ 8172.760209] CPU: 0 PID: 39 Comm: kswapd0 Tainted: G W 5.0.0-rc2-default #408 [ 8172.762495] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [ 8172.765772] RIP: 0010:shrink_page_list+0x2f9/0xe90 [ 8172.770453] RSP: 0018:ffff967f00663b18 EFLAGS: 00010287 [ 8172.771184] RAX: 0000000000000000 RBX: ffff967f00663c20 RCX: 0000000000000000 [ 8172.772850] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8c0620ab20e0 [ 8172.774629] RBP: ffff967f00663dd8 R08: 0000000000000000 R09: 0000000000000000 [ 8172.776094] R10: ffff8c0620ab22f8 R11: ffff8c063f772688 R12: ffff967f00663b78 [ 8172.777533] R13: ffff8c063f625600 R14: ffff8c063f625608 R15: dead000000000200 [ 8172.778886] FS: 0000000000000000(0000) GS:ffff8c063d400000(0000) knlGS:0000000000000000 [ 8172.780545] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8172.781787] CR2: 0000000000000058 CR3: 000000004e962000 CR4: 00000000000006f0 [ 8172.783547] Call Trace: [ 8172.784112] shrink_inactive_list+0x194/0x410 [ 8172.784747] shrink_node_memcg.constprop.85+0x3a5/0x6a0 [ 8172.785472] shrink_node+0x62/0x1e0 [ 8172.786011] balance_pgdat+0x216/0x460 [ 8172.786577] kswapd+0xe3/0x4a0 [ 8172.787085] ? finish_wait+0x80/0x80 [ 8172.787795] ? balance_pgdat+0x460/0x460 [ 8172.788799] kthread+0x116/0x130 [ 8172.789640] ? kthread_create_on_node+0x60/0x60 [ 8172.790323] ret_from_fork+0x24/0x30 [ 8172.794253] CR2: 0000000000000058 or accounting errors at umount time: [ 8159.537251] WARNING: CPU: 2 PID: 19031 at fs/btrfs/extent-tree.c:5987 btrfs_free_block_groups+0x3d5/0x410 [btrfs] [ 8159.543325] CPU: 2 PID: 19031 Comm: umount Tainted: G W 5.0.0-rc2-default #408 [ 8159.545472] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014 [ 8159.548155] RIP: 0010:btrfs_free_block_groups+0x3d5/0x410 [btrfs] [ 8159.554030] RSP: 0018:ffff967f079cbde8 EFLAGS: 00010206 [ 8159.555144] RAX: 0000000001000000 RBX: ffff8c06366cf800 RCX: 0000000000000000 [ 8159.556730] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff8c06255ad800 [ 8159.558279] RBP: ffff8c0637ac0000 R08: 0000000000000001 R09: 0000000000000000 [ 8159.559797] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8c0637ac0108 [ 8159.561296] R13: ffff8c0637ac0158 R14: 0000000000000000 R15: dead000000000100 [ 8159.562852] FS: 00007f7f693b9fc0(0000) GS:ffff8c063d800000(0000) knlGS:0000000000000000 [ 8159.564839] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8159.566160] CR2: 00007f7f68fab7b0 CR3: 000000000aec7000 CR4: 00000000000006e0 [ 8159.567898] Call Trace: [ 8159.568597] close_ctree+0x17f/0x350 [btrfs] [ 8159.569628] generic_shutdown_super+0x64/0x100 [ 8159.570808] kill_anon_super+0x14/0x30 [ 8159.571857] btrfs_kill_super+0x12/0xa0 [btrfs] [ 8159.573063] deactivate_locked_super+0x29/0x60 [ 8159.574234] cleanup_mnt+0x3b/0x70 [ 8159.575176] task_work_run+0x98/0xc0 [ 8159.576177] exit_to_usermode_loop+0x83/0x90 [ 8159.577315] do_syscall_64+0x15b/0x180 [ 8159.578339] entry_SYSCALL_64_after_hwframe+0x49/0xbe This fix is based on 2 Josef's patches that used sideefects of btrfs_create_pending_block_groups, this fix introduces the helper that does what we need. CC: stable@vger.kernel.org # 4.4+ CC: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-30btrfs: fix potential oops in device_list_addAl Viro1-2/+2
alloc_fs_devices() can return ERR_PTR(-ENOMEM), so dereferencing its result before the check for IS_ERR() is a bad idea. Fixes: d1a63002829a4 ("btrfs: add members to fs_devices to track fsid changes") Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-29CIFS: Do not consider -ENODATA as stat failure for readsPavel Shilovsky1-1/+1
When doing reads beyound the end of a file the server returns error STATUS_END_OF_FILE error which is mapped to -ENODATA. Currently we report it as a failure which confuses read stats. Change it to not consider -ENODATA as failure for stat purposes. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2019-01-29CIFS: Do not count -ENODATA as failure for query directoryPavel Shilovsky1-2/+2
Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2019-01-29CIFS: Fix trace command logging for SMB2 reads and writesPavel Shilovsky1-16/+30
Currently we log success once we send an async IO request to the server. Instead we need to analyse a response and then log success or failure for a particular command. Also fix argument list for read logging. Cc: <stable@vger.kernel.org> # 4.18 Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-29CIFS: Fix possible oops and memory leaks in async IOPavel Shilovsky1-3/+8
Allocation of a page array for non-cached IO was separated from allocation of rdata and wdata structures and this introduced memory leaks and a possible null pointer dereference. This patch fixes these problems. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-29cifs: limit amount of data we request for xattrs to CIFSMaxBufSizeRonnie Sahlberg2-3/+16
minus the various headers and blobs that will be part of the reply. or else we might trigger a session reconnect. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-01-29cifs: fix computation for MAX_SMB2_HDR_SIZERonnie Sahlberg1-2/+2
The size of the fixed part of the create response is 88 bytes not 56. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2019-01-29NFS: Fix up return value on fatal errors in nfs_page_async_flush()Trond Myklebust1-4/+5
Ensure that we return the fatal error value that caused us to exit nfs_page_async_flush(). Fixes: c373fff7bd25 ("NFSv4: Don't special case "launder"") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org # v4.12+ Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-28nfs: Fix NULL pointer dereference of dev_nameYao Liu1-0/+5
There is a NULL pointer dereference of dev_name in nfs_parse_devname() The oops looks something like: BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 ... RIP: 0010:nfs_fs_mount+0x3b6/0xc20 [nfs] ... Call Trace: ? ida_alloc_range+0x34b/0x3d0 ? nfs_clone_super+0x80/0x80 [nfs] ? nfs_free_parsed_mount_data+0x60/0x60 [nfs] mount_fs+0x52/0x170 ? __init_waitqueue_head+0x3b/0x50 vfs_kern_mount+0x6b/0x170 do_mount+0x216/0xdc0 ksys_mount+0x83/0xd0 __x64_sys_mount+0x25/0x30 do_syscall_64+0x65/0x220 entry_SYSCALL_64_after_hwframe+0x49/0xbe Fix this by adding a NULL check on dev_name Signed-off-by: Yao Liu <yotta.liu@ucloud.cn> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2019-01-28btrfs: don't end the transaction for delayed refs in throttleJosef Bacik1-8/+0
Previously callers to btrfs_end_transaction_throttle() would commit the transaction if there wasn't enough delayed refs space. This happens in relocation, and if the fs is relatively empty we'll run out of delayed refs space basically immediately, so we'll just be stuck in this loop of committing the transaction over and over again. This code existed because we didn't have a good feedback mechanism for running delayed refs, but with the delayed refs rsv we do now. Delete this throttling code and let the btrfs_start_transaction() in relocation deal with putting pressure on the delayed refs infrastructure. With this patch we no longer take 5 minutes to balance a metadata only fs. Qu has submitted a fstest to catch slow balance or excessive transaction commits. Steps to reproduce: * create subvolume * create many (eg. 16000) inlined files, of size 2KiB * iteratively snapshot and touch several files to trigger metadata updates * start balance -m Reported-by: Qu Wenruo <wqu@suse.com> Fixes: 64403612b73a ("btrfs: rework btrfs_check_space_for_delayed_refs") Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add tags and steps to reproduce ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-28Btrfs: fix deadlock when allocating tree block during leaf/node splitFilipe Manana1-28/+50
When splitting a leaf or node from one of the trees that are modified when flushing pending block groups (extent, chunk, device and free space trees), we need to allocate a new tree block, which in turn can result in the need to allocate a new block group. After allocating the new block group we may need to flush new block groups that were previously allocated during the course of the current transaction, which is what may cause a deadlock due to attempts to write lock twice the same leaf or node, as when splitting a leaf or node we are holding a write lock on it and its parent node. The same type of deadlock can also happen when increasing the tree's height, since we are holding a lock on the existing root while allocating the tree block to use as the new root node. An example trace when the deadlock happens during the leaf split path is: [27175.293054] CPU: 0 PID: 3005 Comm: kworker/u17:6 Tainted: G W 4.19.16 #1 [27175.293942] Hardware name: Penguin Computing Relion 1900/MD90-FS0-ZB-XX, BIOS R15 06/25/2018 [27175.294846] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] (...) [27175.298384] RSP: 0018:ffffab2087107758 EFLAGS: 00010246 [27175.299269] RAX: 0000000000000bbd RBX: ffff9fadc7141c48 RCX: 0000000000000001 [27175.300155] RDX: 0000000000000001 RSI: 0000000000000002 RDI: ffff9fadc7141c48 [27175.301023] RBP: 0000000000000001 R08: ffff9faeb6ac1040 R09: ffff9fa9c0000000 [27175.301887] R10: 0000000000000000 R11: 0000000000000040 R12: ffff9fb21aac8000 [27175.302743] R13: ffff9fb1a64d6a20 R14: 0000000000000001 R15: ffff9fb1a64d6a18 [27175.303601] FS: 0000000000000000(0000) GS:ffff9fb21fa00000(0000) knlGS:0000000000000000 [27175.304468] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27175.305339] CR2: 00007fdc8743ead8 CR3: 0000000763e0a006 CR4: 00000000003606f0 [27175.306220] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [27175.307087] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [27175.307940] Call Trace: [27175.308802] btrfs_search_slot+0x779/0x9a0 [btrfs] [27175.309669] ? update_space_info+0xba/0xe0 [btrfs] [27175.310534] btrfs_insert_empty_items+0x67/0xc0 [btrfs] [27175.311397] btrfs_insert_item+0x60/0xd0 [btrfs] [27175.312253] btrfs_create_pending_block_groups+0xee/0x210 [btrfs] [27175.313116] do_chunk_alloc+0x25f/0x300 [btrfs] [27175.313984] find_free_extent+0x706/0x10d0 [btrfs] [27175.314855] btrfs_reserve_extent+0x9b/0x1d0 [btrfs] [27175.315707] btrfs_alloc_tree_block+0x100/0x5b0 [btrfs] [27175.316548] split_leaf+0x130/0x610 [btrfs] [27175.317390] btrfs_search_slot+0x94d/0x9a0 [btrfs] [27175.318235] btrfs_insert_empty_items+0x67/0xc0 [btrfs] [27175.319087] alloc_reserved_file_extent+0x84/0x2c0 [btrfs] [27175.319938] __btrfs_run_delayed_refs+0x596/0x1150 [btrfs] [27175.320792] btrfs_run_delayed_refs+0xed/0x1b0 [btrfs] [27175.321643] delayed_ref_async_start+0x81/0x90 [btrfs] [27175.322491] normal_work_helper+0xd0/0x320 [btrfs] [27175.323328] ? move_linked_works+0x6e/0xa0 [27175.324160] process_one_work+0x191/0x370 [27175.324976] worker_thread+0x4f/0x3b0 [27175.325763] kthread+0xf8/0x130 [27175.326531] ? rescuer_thread+0x320/0x320 [27175.327284] ? kthread_create_worker_on_cpu+0x50/0x50 [27175.328027] ret_from_fork+0x35/0x40 [27175.328741] ---[ end trace 300a1b9f0ac30e26 ]--- Fix this by preventing the flushing of new blocks groups when splitting a leaf/node and when inserting a new root node for one of the trees modified by the flushing operation, similar to what is done when COWing a node/leaf from on of these trees. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202383 Reported-by: Eli V <eliventer@gmail.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-27iomap: fix a use after free in iomap_dio_rwChristoph Hellwig1-7/+21
Introduce a local wait_for_completion variable to avoid an access to the potentially freed dio struture after dropping the last reference count. Also use the chance to document the completion behavior to make the refcounting clear to the reader of the code. Fixes: ff6a9292e6 ("iomap: implement direct I/O") Reported-by: Chandan Rajendra <chandan@linux.ibm.com> Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Chandan Rajendra <chandan@linux.ibm.com> Tested-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-01-27iomap: get/put the page in iomap_page_create/release()Piotr Jaroszynski1-0/+9
migrate_page_move_mapping() expects pages with private data set to have a page_count elevated by 1. This is what used to happen for xfs through the buffer_heads code before the switch to iomap in commit 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads"). Not having the count elevated causes move_pages() to fail on memory mapped files coming from xfs. Make iomap compatible with the migrate_page_move_mapping() assumption by elevating the page count as part of iomap_page_create() and lowering it in iomap_page_release(). It causes the move_pages() syscall to misbehave on memory mapped files from xfs. It does not not move any pages, which I suppose is "just" a perf issue, but it also ends up returning a positive number which is out of spec for the syscall. Talking to Michal Hocko, it sounds like returning positive numbers might be a necessary update to move_pages() anyway though. Fixes: 82cb14175e7d ("xfs: add support for sub-pagesize writeback without buffer_heads") Signed-off-by: Piotr Jaroszynski <pjaroszynski@nvidia.com> [hch: actually get/put the page iomap_migrate_page() to make it work properly] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-01-26Merge tag '5.0-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds10-67/+136
Pull smb3 fixes from Steve French: "A set of small smb3 fixes, some fixing various crediting issues discovered during xfstest runs, five for stable" * tag '5.0-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: print CIFSMaxBufSize as part of /proc/fs/cifs/DebugData smb3: add credits we receive from oplock/break PDUs CIFS: Fix mounts if the client is low on credits CIFS: Do not assume one credit for async responses CIFS: Fix credit calculations in compound mid callback CIFS: Fix credit calculation for encrypted reads with errors CIFS: Fix credits calculations for reads with errors CIFS: Do not reconnect TCP session in add_credits() smb3: Cleanup license mess CIFS: Fix possible hang during async MTU reads and writes cifs: fix memory leak of an allocated cifs_ntsd structure
2019-01-26Merge tag 'for-linus-20190125' of git://git.kernel.dk/linux-blockLinus Torvalds2-4/+41
Pull block fixes from Jens Axboe: "A collection of fixes for this release. This contains: - Silence sparse rightfully complaining about non-static wbt functions (Bart) - Fixes for the zoned comments/ioctl documentation (Damien) - direct-io fix that's been lingering for a while (Ernesto) - cgroup writeback fix (Tejun) - Set of NVMe patches for nvme-rdma/tcp (Sagi, Hannes, Raju) - Block recursion tracking fix (Ming) - Fix debugfs command flag naming for a few flags (Jianchao)" * tag 'for-linus-20190125' of git://git.kernel.dk/linux-block: block: Fix comment typo uapi: fix ioctl documentation blk-wbt: Declare local functions static blk-mq: fix the cmd_flag_name array nvme-multipath: drop optimization for static ANA group IDs nvmet-rdma: fix null dereference under heavy load nvme-rdma: rework queue maps handling nvme-tcp: fix timeout handler nvme-rdma: fix timeout handler writeback: synchronize sync(2) against cgroup writeback membership switches block: cover another queue enter recursion via BIO_QUEUE_ENTERED direct-io: allow direct writes to empty inodes
2019-01-24cifs: print CIFSMaxBufSize as part of /proc/fs/cifs/DebugDataRonnie Sahlberg1-0/+1
Was helpful in debug for some recent problems. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24smb3: add credits we receive from oplock/break PDUsRonnie Sahlberg1-0/+7
Otherwise we gradually leak credits leading to potential hung session. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Fix mounts if the client is low on creditsPavel Shilovsky1-0/+17
If the server doesn't grant us at least 3 credits during the mount we won't be able to complete it because query path info operation requires 3 credits. Use the cached file handle if possible to allow the mount to succeed. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Do not assume one credit for async responsesPavel Shilovsky1-4/+11
If we don't receive a response we can't assume that the server granted one credit. Assume zero credits in such cases. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Fix credit calculations in compound mid callbackPavel Shilovsky2-11/+6
The current code doesn't do proper accounting for credits in SMB1 case: it adds one credit per response only if we get a complete response while it needs to return it unconditionally. Fix this and also include malformed responses for SMB2+ into accounting for credits because such responses have Credit Granted field, thus nothing prevents to get a proper credit value from them. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Fix credit calculation for encrypted reads with errorsPavel Shilovsky1-10/+14
We do need to account for credits received in error responses to read requests on encrypted sessions. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Fix credits calculations for reads with errorsPavel Shilovsky1-12/+23
Currently we mark MID as malformed if we get an error from server in a read response. This leads to not properly processing credits in the readv callback. Fix this by marking such a response as normal received response and process it appropriately. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Do not reconnect TCP session in add_credits()Pavel Shilovsky2-7/+46
When executing add_credits() we currently call cifs_reconnect() if the number of credits is zero and there are no requests in flight. In this case we may call cifs_reconnect() recursively twice and cause memory corruption given the following sequence of functions: mid1.callback() -> add_credits() -> cifs_reconnect() -> -> mid2.callback() -> add_credits() -> cifs_reconnect(). Fix this by avoiding to call cifs_reconnect() in add_credits() and checking for zero credits in the demultiplex thread. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-25Merge tag 'fsnotify_for_v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fsLinus Torvalds1-2/+4
Pull inotify fix from Jan Kara: "Fix a file refcount leak in an inotify error path" * tag 'fsnotify_for_v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: inotify: Fix fd refcount leak in inotify_add_watch().
2019-01-24smb3: Cleanup license messThomas Gleixner2-20/+0
Precise and non-ambiguous license information is important. The recently added aegis header file has a SPDX license identifier, which is nice, but at the same time it has a contradictionary license boiler plate text. SPDX-License-Identifier: GPL-2.0 versus * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. Oh well. Assuming that the SPDX identifier is correct and according to x86/hyper-v contributions from Microsoft GPL V2 only is the usual license. Remove the boiler plate as it is wrong and even if correct it is redundant. Fixes: eccb4422cf97 ("smb3: Add ftrace tracepoints for improved SMB3 debugging") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Steve French <sfrench@samba.org> Cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24CIFS: Fix possible hang during async MTU reads and writesPavel Shilovsky1-3/+3
When doing MTU i/o we need to leave some credits for possible reopen requests and other operations happening in parallel. Currently we leave 1 credit which is not enough even for reopen only: we need at least 2 credits if durable handle reconnect fails. Also there may be other operations at the same time including compounding ones which require 3 credits at a time each. Fix this by leaving 8 credits which is big enough to cover most scenarios. Was able to reproduce this when server was configured to give out fewer credits than usual. The proper fix would be to reconnect a file handle first and then obtain credits for an MTU request but this leads to bigger code changes and should happen in other patches. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-24cifs: fix memory leak of an allocated cifs_ntsd structureColin Ian King1-0/+8
The call to SMB2_queary_acl can allocate memory to pntsd and also return a failure via a call to SMB2_query_acl (and then query_info). This occurs when query_info allocates the structure and then in query_info the call to smb2_validate_and_copy_iov fails. Currently the failure just returns without kfree'ing pntsd hence causing a memory leak. Currently, *data is allocated if it's not already pointing to a buffer, so it needs to be kfree'd only if was allocated in query_info, so the fix adds an allocated flag to track this. Also set *dlen to zero on an error just to be safe since *data is kfree'd. Also set errno to -ENOMEM if the allocation of *data fails. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Dan Carpener <dan.carpenter@oracle.com>
2019-01-22writeback: synchronize sync(2) against cgroup writeback membership switchesTejun Heo1-2/+38
sync_inodes_sb() can race against cgwb (cgroup writeback) membership switches and fail to writeback some inodes. For example, if an inode switches to another wb while sync_inodes_sb() is in progress, the new wb might not be visible to bdi_split_work_to_wbs() at all or the inode might jump from a wb which hasn't issued writebacks yet to one which already has. This patch adds backing_dev_info->wb_switch_rwsem to synchronize cgwb switch path against sync_inodes_sb() so that sync_inodes_sb() is guaranteed to see all the target wbs and inodes can't jump wbs to escape syncing. v2: Fixed misplaced rwsem init. Spotted by Jiufei. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jiufei Xue <xuejiufei@gmail.com> Link: http://lkml.kernel.org/r/dc694ae2-f07f-61e1-7097-7c8411cee12d@gmail.com Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-22direct-io: allow direct writes to empty inodesErnesto A. Fernández1-2/+3
On a DIO_SKIP_HOLES filesystem, the ->get_block() method is currently not allowed to create blocks for an empty inode. This confusion comes from trying to bit shift a negative number, so check the size of the inode first. The problem is most visible for hfsplus, because the fallback to buffered I/O doesn't happen and the write fails with EIO. This is in part the fault of the module, because it gives a wrong return value on ->get_block(); that will be fixed in a separate patch. Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-01-21ceph: quota: cleanup license messThomas Gleixner1-13/+0
Precise and non-ambiguous license information is important. The recently added quota.c file has a SPDX license identifier, which is nice, but at the same time it has a contradictionary license boiler plate text. SPDX-License-Identifier: GPL-2.0 versus * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License * as published by the Free Software Foundation; either version 2 * of the License, or (at your option) any later version. Oh well. As the other ceph related files are licensed under the GPL v2 only, it's assumed that the SPDX id is correct and the boiler plate was randomly copied into that patch. Remove the boiler plate as it is wrong and even if correct it is redundant. Fixes: fb18a57568c2 ("ceph: quota: add initial infrastructure to support cephfs quotas") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Luis Henriques <lhenriques@suse.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: "Yan, Zheng" <zyan@redhat.com> Cc: Sage Weil <sage@redhat.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: ceph-devel@vger.kernel.org Acked-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-01-21ceph: clear inode pointer when snap realm gets dropped by its inodeYan, Zheng1-0/+2
snap realm and corresponding inode have pointers to each other. The two pointer should get clear at the same time. Otherwise, snap realm's pointer may reference freed inode. Cc: stable@vger.kernel.org # 4.17+ Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-01-21Merge tag 'pstore-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linuxLinus Torvalds1-8/+4
Pull pstore fixes from Kees Cook: - Fix console ramoops to show the previous boot logs (Sai Prakash Ranjan) - Avoid allocation and leak of platform data * tag 'pstore-v5.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore/ram: Avoid allocation and leak of platform data pstore/ram: Fix console ramoops to show the previous boot logs
2019-01-20pstore/ram: Avoid allocation and leak of platform dataKees Cook1-6/+3
Yue Hu noticed that when parsing device tree the allocated platform data was never freed. Since it's not used beyond the function scope, this switches to using a stack variable instead. Reported-by: Yue Hu <huyue2@yulong.com> Fixes: 35da60941e44 ("pstore/ram: add Device Tree bindings") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>