aboutsummaryrefslogtreecommitdiffstats
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-01-11CIFS: Fix credits calculation for cancelled requestsPavel Shilovsky2-2/+27
If a request is cancelled, we can't assume that the server returns 1 credit back. Instead we need to wait for a response and process the number of credits granted by the server. Create a separate mid callback for cancelled request, parse the number of credits in a response buffer and add them to the client's credits. If the didn't get a response (no response buffer available) assume 0 credits granted. The latter most probably happens together with session reconnect, so the client's credits are adjusted anyway. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-11cifs: Fix potential OOB access of lock element arrayRoss Lagerwall2-6/+6
If maxBuf is small but non-zero, it could result in a zero sized lock element array which we would then try and access OOB. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2019-01-11cifs: Limit memory used by lock request calls to a pageRoss Lagerwall2-0/+12
The code tries to allocate a contiguous buffer with a size supplied by the server (maxBuf). This could fail if memory is fragmented since it results in high order allocations for commonly used server implementations. It is also wasteful since there are probably few locks in the usual case. Limit the buffer to be no larger than a page to avoid memory allocation failures due to fragmentation. Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-11cifs: move large array from stack to heapAurelien Aptel2-14/+32
This addresses some compile warnings that you can see depending on configuration settings. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-11CIFS: Do not hide EINTR after sending network packetsPavel Shilovsky1-1/+1
Currently we hide EINTR code returned from sock_sendmsg() and return 0 instead. This makes a caller think that we successfully completed the network operation which is not true. Fix this by properly returning EINTR to callers. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-10CIFS: Fix credit computation for compounded requestsPavel Shilovsky1-18/+41
In SMB3 protocol every part of the compound chain consumes credits individually, so we need to call wait_for_free_credits() for each of the PDUs in the chain. If an operation is interrupted, we must ensure we return all credits taken from the server structure back. Without this patch server can sometimes disconnect the session due to credit mismatches, especially when first operation(s) are large writes. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2019-01-10CIFS: Do not set credits to 1 if the server didn't grant anythingPavel Shilovsky1-2/+0
Currently we reset the number of total credits granted by the server to 1 if the server didn't grant us anything int the response. This violates the SMB3 protocol - we need to trust the server and use the credit values from the response. Fix this by removing the corresponding code. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2019-01-10CIFS: Fix adjustment of credits for MTU requestsPavel Shilovsky1-2/+6
Currently for MTU requests we allocate maximum possible credits in advance and then adjust them according to the request size. While we were adjusting the number of credits belonging to the server, we were skipping adjustment of credits belonging to the request. This patch fixes it by setting request credits to CreditCharge field value of SMB2 packet header. Also ask 1 credit more for async read and write operations to increase parallelism and match the behavior of other operations. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org>
2019-01-10cifs: Fix a tiny potential memory leakDan Carpenter1-0/+1
The most recent "it" allocation is leaked on this error path. I believe that small allocations always succeed in current kernels so this doesn't really affect run time. Fixes: 54be1f6c1c37 ("cifs: Add DFS cache routines") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2019-01-10cifs: Fix a debug messageDan Carpenter1-3/+4
This debug message was never shown because it was checking for NULL returns but extract_hostname() returns error pointers. Fixes: 93d5cb517db3 ("cifs: Add support for failover in cifs_reconnect()") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Paulo Alcantara <palcantara@suse.de>
2019-01-10afs: Set correct lock type for the yfs CreateFileMarc Dionne2-1/+12
A lock type of 0 is "LockRead", which makes the fileserver record an unintentional read lock on the new file. This will cause problems later on if the file is the subject of locking operations. The correct default value should be -1 ("LockNone"). Fix the operation marshalling code to set the value and provide an enum to symbolise the values whilst we're at it. Fixes: 30062bd13e36 ("afs: Implement YFS support in the fs client") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-10afs: Use struct_size() in kzalloc()Gustavo A. R. Silva1-3/+1
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void *entry[]; }; instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-10btrfs: Use real device structure to verify dev extentQu Wenruo1-0/+12
[BUG] Linux v5.0-rc1 will fail fstests/btrfs/163 with the following kernel message: BTRFS error (device dm-6): dev extent devid 1 physical offset 13631488 len 8388608 is beyond device boundary 0 BTRFS error (device dm-6): failed to verify dev extents against chunks: -117 BTRFS error (device dm-6): open_ctree failed [CAUSE] Commit cf90d884b347 ("btrfs: Introduce mount time chunk <-> dev extent mapping check") introduced strict check on dev extents. We use btrfs_find_device() with dev uuid and fs uuid set to NULL, and only dependent on @devid to find the real device. For seed devices, we call clone_fs_devices() in open_seed_devices() to allow us search seed devices directly. However clone_fs_devices() just populates devices with devid and dev uuid, without populating other essential members, like disk_total_bytes. This makes any device returned by btrfs_find_device(fs_info, devid, NULL, NULL) is just a dummy, with 0 disk_total_bytes, and any dev extents on the seed device will not pass the device boundary check. [FIX] This patch will try to verify the device returned by btrfs_find_device() and if it's a dummy then re-search in seed devices. Fixes: cf90d884b347 ("btrfs: Introduce mount time chunk <-> dev extent mapping check") CC: stable@vger.kernel.org # 4.19+ Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-09Btrfs: fix deadlock when using free space tree due to block group creationFilipe Manana1-7/+9
When modifying the free space tree we can end up COWing one of its extent buffers which in turn might result in allocating a new chunk, which in turn can result in flushing (finish creation) of pending block groups. If that happens we can deadlock because creating a pending block group needs to update the free space tree, and if any of the updates tries to modify the same extent buffer that we are COWing, we end up in a deadlock since we try to write lock twice the same extent buffer. So fix this by skipping pending block group creation if we are COWing an extent buffer from the free space tree. This is a case missed by commit 5ce555578e091 ("Btrfs: fix deadlock when writing out free space caches"). Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202173 Fixes: 5ce555578e091 ("Btrfs: fix deadlock when writing out free space caches") CC: stable@vger.kernel.org # 4.18+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-09Btrfs: fix race between reflink/dedupe and relocationFilipe Manana1-6/+28
The recent rework that makes btrfs' remap_file_range operation use the generic helper generic_remap_file_range_prep() introduced a race between relocation and reflinking (for both cloning and deduplication) the file extents between the source and destination inodes. This happens because we no longer lock the source range anymore, and we do not lock it anymore because we wait for direct IO writes and writeback to complete early on the code path right after locking the inodes, which guarantees no other file operations interfere with the reflinking. However there is one exception which is relocation, since it replaces the byte number of file extents items in the fs tree after locking the range the file extent items represent. This is a problem because after finding each file extent to clone in the fs tree, the reflink process copies the file extent item into a local buffer, releases the search path, inserts new file extent items in the destination range and then increments the reference count for the extent mentioned in the file extent item that it previously copied to the buffer. If right after copying the file extent item into the buffer and releasing the path the relocation process updates the file extent item to point to the new extent, the reflink process ends up creating a delayed reference to increment the reference count of the old extent, for which the relocation process already created a delayed reference to drop it. This results in failure to run delayed references because we will attempt to increment the count of a reference that was already dropped. This is illustrated by the following diagram: CPU 1 CPU 2 relocation is running btrfs_clone_files() btrfs_clone() --> finds extent item in source range point to extent at bytenr X --> copies it into a local buffer --> releases path replace_file_extents() --> successfully locks the range represented by the file extent item --> replaces disk_bytenr field in the file extent item with some other value Y --> creates delayed reference to increment reference count for extent at bytenr Y --> creates delayed reference to drop the extent at bytenr X --> starts transaction --> creates delayed reference to increment extent at bytenr X <delayed references are run, due to a transaction commit for example, and the transaction is aborted with -EIO because we attempt to increment reference count for the extent at bytenr X after we freed it> When this race is hit the running transaction ends up getting aborted with an -EIO error and a trace like the following is produced: [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G W 4.20.0-rc6-btrfs-next-41 #1 [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202 [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001 [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000 [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000 [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548 [ 4382.556315] FS: 00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000 [ 4382.563130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0 [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4382.564887] Call Trace: [ 4382.565343] insert_inline_extent_backref+0x55/0xe0 [btrfs] [ 4382.565796] __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs] [ 4382.566249] ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs] [ 4382.566702] __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs] [ 4382.567162] btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs] [ 4382.567623] btrfs_commit_transaction+0x50/0x9c0 [btrfs] [ 4382.568112] ? _raw_spin_unlock+0x24/0x30 [ 4382.568557] ? block_rsv_release_bytes+0x14e/0x410 [btrfs] [ 4382.569006] create_subvol+0x3c8/0x830 [btrfs] [ 4382.569461] ? btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.569906] btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.570383] ? rcu_sync_lockdep_assert+0xe/0x60 [ 4382.570822] ? __sb_start_write+0xd4/0x1c0 [ 4382.571262] ? mnt_want_write_file+0x24/0x50 [ 4382.571712] btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs] [ 4382.572155] ? _copy_from_user+0x66/0x90 [ 4382.572602] btrfs_ioctl_snap_create+0x66/0x80 [btrfs] [ 4382.573052] btrfs_ioctl+0x7c1/0x30e0 [btrfs] [ 4382.573502] ? mem_cgroup_commit_charge+0x8b/0x570 [ 4382.573946] ? do_raw_spin_unlock+0x49/0xc0 [ 4382.574379] ? _raw_spin_unlock+0x24/0x30 [ 4382.574803] ? __handle_mm_fault+0xf29/0x12d0 [ 4382.575215] ? do_vfs_ioctl+0xa2/0x6f0 [ 4382.575622] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 4382.576020] do_vfs_ioctl+0xa2/0x6f0 [ 4382.576405] ksys_ioctl+0x70/0x80 [ 4382.576776] __x64_sys_ioctl+0x16/0x20 [ 4382.577137] do_syscall_64+0x60/0x1b0 [ 4382.577488] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7 [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003 [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044 [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010 [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050 [ 4382.580792] irq event stamp: 0 [ 4382.581106] hardirqs last enabled at (0): [<0000000000000000>] (null) [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.581772] softirqs last enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>] (null) [ 4382.582413] ---[ end trace d3c188e3e9367382 ]--- [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure [ 4382.624295] BTRFS info (device sdc): forced readonly Fix this by locking the source range before searching for the file extent items in the fs tree, since the relocation process will try to lock the range a file extent item represents before updating it with the new extent location. Fixes: 34a28e3d7753 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-09Btrfs: fix race between cloning range ending at eof and writebackFilipe Manana1-0/+15
The recent rework that makes btrfs' remap_file_range operation use the generic helper generic_remap_file_range_prep() introduced a race between writeback and cloning a range that covers the eof extent of the source file into a destination offset that is greater then the same file's size. This happens because we now wait for writeback to complete before doing the truncation of the eof block, while previously we did the truncation and then waited for writeback to complete. This leads to a race between writeback of the truncated block and cloning the file extents in the source range, because we copy each file extent item we find in the fs root into a buffer, then release the path and then increment the reference count for the extent referred in that file extent item we copied, which can no longer exist if writeback of the truncated eof block completes after we copied the file extent item into the buffer and before we incremented the reference count. This is illustrated by the following diagram: CPU 1 CPU 2 btrfs_clone_files() btrfs_cont_expand() btrfs_truncate_block() --> zeroes part of the page containg eof, marking it for delalloc btrfs_clone() --> finds extent item covering eof, points to extent at bytenr X --> copies it into a local buffer --> releases path writeback starts btrfs_finish_ordered_io() insert_reserved_file_extent() __btrfs_drop_extents() --> creates delayed reference to drop the extent at bytenr X --> starts transaction --> creates delayed reference to increment extent at bytenr X <delayed references are run, due to a transaction commit for example, and the transaction is aborted with -EIO because we attempt to increment reference count for the extent at bytenr X after we freed it> When this race is hit the running transaction ends up getting aborted with an -EIO error and a trace like the following is produced: [ 4382.553858] WARNING: CPU: 2 PID: 3648 at fs/btrfs/extent-tree.c:1552 lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556293] CPU: 2 PID: 3648 Comm: btrfs Tainted: G W 4.20.0-rc6-btrfs-next-41 #1 [ 4382.556294] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [ 4382.556308] RIP: 0010:lookup_inline_extent_backref+0x4f4/0x650 [btrfs] (...) [ 4382.556310] RSP: 0018:ffffac784408f738 EFLAGS: 00010202 [ 4382.556311] RAX: 0000000000000001 RBX: ffff8980673c3a48 RCX: 0000000000000001 [ 4382.556312] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000 [ 4382.556312] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 4382.556313] R10: 0000000000000001 R11: ffff897f40000000 R12: 0000000000001000 [ 4382.556313] R13: 00000000c224f000 R14: ffff89805de9bd40 R15: ffff8980453f4548 [ 4382.556315] FS: 00007f5e759178c0(0000) GS:ffff89807b300000(0000) knlGS:0000000000000000 [ 4382.563130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4382.563562] CR2: 00007f2e9789fcbc CR3: 0000000120512001 CR4: 00000000003606e0 [ 4382.564005] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4382.564451] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4382.564887] Call Trace: [ 4382.565343] insert_inline_extent_backref+0x55/0xe0 [btrfs] [ 4382.565796] __btrfs_inc_extent_ref.isra.60+0x88/0x260 [btrfs] [ 4382.566249] ? __btrfs_run_delayed_refs+0x93/0x1650 [btrfs] [ 4382.566702] __btrfs_run_delayed_refs+0xa22/0x1650 [btrfs] [ 4382.567162] btrfs_run_delayed_refs+0x7e/0x1d0 [btrfs] [ 4382.567623] btrfs_commit_transaction+0x50/0x9c0 [btrfs] [ 4382.568112] ? _raw_spin_unlock+0x24/0x30 [ 4382.568557] ? block_rsv_release_bytes+0x14e/0x410 [btrfs] [ 4382.569006] create_subvol+0x3c8/0x830 [btrfs] [ 4382.569461] ? btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.569906] btrfs_mksubvol+0x317/0x600 [btrfs] [ 4382.570383] ? rcu_sync_lockdep_assert+0xe/0x60 [ 4382.570822] ? __sb_start_write+0xd4/0x1c0 [ 4382.571262] ? mnt_want_write_file+0x24/0x50 [ 4382.571712] btrfs_ioctl_snap_create_transid+0x117/0x1a0 [btrfs] [ 4382.572155] ? _copy_from_user+0x66/0x90 [ 4382.572602] btrfs_ioctl_snap_create+0x66/0x80 [btrfs] [ 4382.573052] btrfs_ioctl+0x7c1/0x30e0 [btrfs] [ 4382.573502] ? mem_cgroup_commit_charge+0x8b/0x570 [ 4382.573946] ? do_raw_spin_unlock+0x49/0xc0 [ 4382.574379] ? _raw_spin_unlock+0x24/0x30 [ 4382.574803] ? __handle_mm_fault+0xf29/0x12d0 [ 4382.575215] ? do_vfs_ioctl+0xa2/0x6f0 [ 4382.575622] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [ 4382.576020] do_vfs_ioctl+0xa2/0x6f0 [ 4382.576405] ksys_ioctl+0x70/0x80 [ 4382.576776] __x64_sys_ioctl+0x16/0x20 [ 4382.577137] do_syscall_64+0x60/0x1b0 [ 4382.577488] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [ 4382.578837] RSP: 002b:00007ffe04bf64c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [ 4382.579174] RAX: ffffffffffffffda RBX: 00005564136f3050 RCX: 00007f5e74724dd7 [ 4382.579505] RDX: 00007ffe04bf64d0 RSI: 000000005000940e RDI: 0000000000000003 [ 4382.579848] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000044 [ 4382.580164] R10: 0000000000000541 R11: 0000000000000202 R12: 00005564136f3010 [ 4382.580477] R13: 0000000000000003 R14: 00005564136f3035 R15: 00005564136f3050 [ 4382.580792] irq event stamp: 0 [ 4382.581106] hardirqs last enabled at (0): [<0000000000000000>] (null) [ 4382.581441] hardirqs last disabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.581772] softirqs last enabled at (0): [<ffffffff8d085842>] copy_process.part.32+0x6e2/0x2320 [ 4382.582095] softirqs last disabled at (0): [<0000000000000000>] (null) [ 4382.582413] ---[ end trace d3c188e3e9367382 ]--- [ 4382.623855] BTRFS: error (device sdc) in btrfs_run_delayed_refs:2981: errno=-5 IO failure [ 4382.624295] BTRFS info (device sdc): forced readonly Fix this by waiting for writeback to complete after truncating the eof block. Fixes: 34a28e3d7753 ("Btrfs: use generic_remap_file_range_prep() for cloning and deduplication") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-01-08hugetlbfs: revert "Use i_mmap_rwsem to fix page fault/truncate race"Mike Kravetz1-28/+33
This reverts c86aa7bbfd5568ba8a82d3635d8f7b8a8e06fe54 The reverted commit caused ABBA deadlocks when file migration raced with file eviction for specific hugetlbfs files. This was discovered with a modified version of the LTP move_pages12 test. The purpose of the reverted patch was to close a long existing race between hugetlbfs file truncation and page faults. After more analysis of the patch and impacted code, it was determined that i_mmap_rwsem can not be used for all required synchronization. Therefore, revert this patch while working an another approach to the underlying issue. Link: http://lkml.kernel.org/r/20190103235452.29335-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Jan Stancek <jstancek@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prakash Sangappa <prakash.sangappa@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-07ceph: use vmf_error() in ceph_filemap_fault()Souptick Joarder1-4/+1
This code is converted to use vmf_error(). Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-01-07libceph: allow setting abort_on_full for rbdDongsheng Yang1-2/+2
Introduce a new option abort_on_full, default to false. Then we can get -ENOSPC when the pool is full, or reaches quota. [ Don't show abort_on_full in /proc/mounts. ] Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2019-01-07sysfs: convert BUG_ON to WARN_ONGreg Kroah-Hartman4-5/+10
It's rude to crash the system just because the developer did something wrong, as it prevents them from usually even seeing what went wrong. So convert the few BUG_ON() calls that have snuck into the sysfs code over the years to WARN_ON() to make it more "friendly". All of these are able to be recovered from, so it makes no sense to crash. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-06Merge tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscryptLinus Torvalds5-110/+363
Pull fscrypt updates from Ted Ts'o: "Add Adiantum support for fscrypt" * tag 'fscrypt_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/fscrypt: fscrypt: add Adiantum support
2019-01-06Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds4-10/+19
Pull ext4 bug fixes from Ted Ts'o: "Fix a number of ext4 bugs" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix special inode number checks in __ext4_iget() ext4: track writeback errors using the generic tracking infrastructure ext4: use ext4_write_inode() when fsyncing w/o a journal ext4: avoid kernel warning when writing the superblock to a dead device ext4: fix a potential fiemap/page fault deadlock w/ inline_data ext4: make sure enough credits are reserved for dioread_nolock writes
2019-01-06fscrypt: add Adiantum supportEric Biggers5-110/+363
Add support for the Adiantum encryption mode to fscrypt. Adiantum is a tweakable, length-preserving encryption mode with security provably reducible to that of XChaCha12 and AES-256, subject to a security bound. It's also a true wide-block mode, unlike XTS. See the paper "Adiantum: length-preserving encryption for entry-level processors" (https://eprint.iacr.org/2018/720.pdf) for more details. Also see commit 059c2a4d8e16 ("crypto: adiantum - add Adiantum support"). On sufficiently long messages, Adiantum's bottlenecks are XChaCha12 and the NH hash function. These algorithms are fast even on processors without dedicated crypto instructions. Adiantum makes it feasible to enable storage encryption on low-end mobile devices that lack AES instructions; currently such devices are unencrypted. On ARM Cortex-A7, on 4096-byte messages Adiantum encryption is about 4 times faster than AES-256-XTS encryption; decryption is about 5 times faster. In fscrypt, Adiantum is suitable for encrypting both file contents and names. With filenames, it fixes a known weakness: when two filenames in a directory share a common prefix of >= 16 bytes, with CTS-CBC their encrypted filenames share a common prefix too, leaking information. Adiantum does not have this problem. Since Adiantum also accepts long tweaks (IVs), it's also safe to use the master key directly for Adiantum encryption rather than deriving per-file keys, provided that the per-file nonce is included in the IVs and the master key isn't used for any other encryption mode. This configuration saves memory and improves performance. A new fscrypt policy flag is added to allow users to opt-in to this configuration. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-01-05Merge tag '4.21-smb3-small-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds4-16/+32
Pull smb3 fixes from Steve French: "Three fixes, one for stable, one adds the (most secure) SMB3.1.1 dialect to default list requested" * tag '4.21-smb3-small-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb3: add smb3.1.1 to default dialect list cifs: fix confusing warning message on reconnect smb3: fix large reads on encrypted connections
2019-01-05Merge tag 'xfs-4.21-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2-2/+0
Pull xfs fixlets from Darrick Wong: "Remove a couple of unnecessary local variables" * tag 'xfs-4.21-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: xfs_fsops: drop useless LIST_HEAD xfs: xfs_buf: drop useless LIST_HEAD
2019-01-05Merge tag 'ceph-for-4.21-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds5-128/+153
Pull ceph updates from Ilya Dryomov: "A fairly quiet round: a couple of messenger performance improvements from myself and a few cap handling fixes from Zheng" * tag 'ceph-for-4.21-rc1' of git://github.com/ceph/ceph-client: ceph: don't encode inode pathes into reconnect message ceph: update wanted caps after resuming stale session ceph: skip updating 'wanted' caps if caps are already issued ceph: don't request excl caps when mount is readonly ceph: don't update importing cap's mseq when handing cap export libceph: switch more to bool in ceph_tcp_sendmsg() libceph: use MSG_SENDPAGE_NOTLAST with ceph_tcp_sendpage() libceph: use sock_no_sendpage() as a fallback in ceph_tcp_sendpage() libceph: drop last_piece logic from write_partial_message_data() ceph: remove redundant assignment ceph: cleanup splice_dentry()
2019-01-05Merge branch 'mount.part1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds8-174/+166
Pull vfs mount API prep from Al Viro: "Mount API prereqs. Mostly that's LSM mount options cleanups. There are several minor fixes in there, but nothing earth-shattering (leaks on failure exits, mostly)" * 'mount.part1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (27 commits) mount_fs: suppress MAC on MS_SUBMOUNT as well as MS_KERNMOUNT smack: rewrite smack_sb_eat_lsm_opts() smack: get rid of match_token() smack: take the guts of smack_parse_opts_str() into a new helper LSM: new method: ->sb_add_mnt_opt() selinux: rewrite selinux_sb_eat_lsm_opts() selinux: regularize Opt_... names a bit selinux: switch away from match_token() selinux: new helper - selinux_add_opt() LSM: bury struct security_mnt_opts smack: switch to private smack_mnt_opts selinux: switch to private struct selinux_mnt_opts LSM: hide struct security_mnt_opts from any generic code selinux: kill selinux_sb_get_mnt_opts() LSM: turn sb_eat_lsm_opts() into a method nfs_remount(): don't leak, don't ignore LSM options quietly btrfs: sanitize security_mnt_opts use selinux; don't open-code a loop in sb_finish_set_opts() LSM: split ->sb_set_mnt_opts() out of ->sb_kern_mount() new helper: security_sb_eat_lsm_opts() ...
2019-01-05Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2-2/+2
Pull trivial vfs updates from Al Viro: "A few cleanups + Neil's namespace_unlock() optimization" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: exec: make prepare_bprm_creds static genheaders: %-<width>s had been there since v6; %-*s - since v7 VFS: use synchronize_rcu_expedited() in namespace_unlock() iov_iter: reduce code duplication
2019-01-05Merge branch 'akpm' (patches from Andrew)Linus Torvalds36-313/+399
Merge more updates from Andrew Morton: - procfs updates - various misc bits - lib/ updates - epoll updates - autofs - fatfs - a few more MM bits * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (58 commits) mm/page_io.c: fix polled swap page in checkpatch: add Co-developed-by to signature tags docs: fix Co-Developed-by docs drivers/base/platform.c: kmemleak ignore a known leak fs: don't open code lru_to_page() fs/: remove caller signal_pending branch predictions mm/: remove caller signal_pending branch predictions arch/arc/mm/fault.c: remove caller signal_pending_branch predictions kernel/sched/: remove caller signal_pending branch predictions kernel/locking/mutex.c: remove caller signal_pending branch predictions mm: select HAVE_MOVE_PMD on x86 for faster mremap mm: speed up mremap by 20x on large regions mm: treewide: remove unused address argument from pte_alloc functions initramfs: cleanup incomplete rootfs scripts/gdb: fix lx-version string output kernel/kcov.c: mark write_comp_data() as notrace kernel/sysctl: add panic_print into sysctl panic: add options to print system info when panic happens bfs: extra sanity checking and static inode bitmap exec: separate MM_ANONPAGES and RLIMIT_STACK accounting ...
2019-01-04fs: don't open code lru_to_page()Nikolay Borisov7-11/+12
Multiple filesystems open code lru_to_page(). Rectify this by moving the macro from mm_inline (which is specific to lru stuff) to the more generic mm.h header and start using the macro where appropriate. No functional changes. Link: http://lkml.kernel.org/r/20181129104810.23361-1-nborisov@suse.com Link: https://lkml.kernel.org/r/20181129075301.29087-1-nborisov@suse.com Signed-off-by: Nikolay Borisov <nborisov@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Pankaj gupta <pagupta@redhat.com> Acked-by: "Yan, Zheng" <zyan@redhat.com> [ceph] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/: remove caller signal_pending branch predictionsDavidlohr Bueso5-6/+6
This is already done for us internally by the signal machinery. [akpm@linux-foundation.org: fix fs/buffer.c] Link: http://lkml.kernel.org/r/20181116002713.8474-7-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04bfs: extra sanity checking and static inode bitmapTigran Aivazian4-42/+40
Strengthen validation of BFS superblock against corruption. Make in-core inode bitmap static part of superblock info structure. Print a warning when mounting a BFS filesystem created with "-N 512" option as only 510 files can be created in the root directory. Make the kernel messages more uniform. Update the 'prefix' passed to bfs_dump_imap() to match the current naming of operations. White space and comments cleanup. Link: http://lkml.kernel.org/r/CAK+_RLkFZMduoQF36wZFd3zLi-6ZutWKsydjeHFNdtRvZZEb4w@mail.gmail.com Signed-off-by: Tigran Aivazian <aivazian.tigran@gmail.com> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04exec: separate MM_ANONPAGES and RLIMIT_STACK accountingOleg Nesterov1-53/+52
get_arg_page() checks bprm->rlim_stack.rlim_cur and re-calculates the "extra" size for argv/envp pointers every time, this is a bit ugly and even not strictly correct: acct_arg_size() must not account this size. Remove all the rlimit code in get_arg_page(). Instead, add bprm->argmin calculated once at the start of __do_execve_file() and change copy_strings to check bprm->p >= bprm->argmin. The patch adds the new helper, prepare_arg_pages() which initializes bprm->argc/envc and bprm->argmin. [oleg@redhat.com: fix !CONFIG_MMU version of get_arg_page()] Link: http://lkml.kernel.org/r/20181126122307.GA1660@redhat.com [akpm@linux-foundation.org: use max_t] Link: http://lkml.kernel.org/r/20181112160910.GA28440@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04exec: load_script: don't blindly truncate shebang stringOleg Nesterov1-3/+7
load_script() simply truncates bprm->buf and this is very wrong if the length of shebang string exceeds BINPRM_BUF_SIZE-2. This can silently truncate i_arg or (worse) we can execute the wrong binary if buf[2:126] happens to be the valid executable path. Change load_script() to return ENOEXEC if it can't find '\n' or zero in bprm->buf. Note that '\0' can come from either prepare_binprm()->memset() or from kernel_read(), we do not care. Link: http://lkml.kernel.org/r/20181112160931.GA28463@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Ben Woodard <woodard@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fat: new inline functions to determine the FAT variant (32, 16 or 12)Carmeli Tamir6-22/+39
This patch introduces 3 new inline functions - is_fat12, is_fat16 and is_fat32, and replaces every occurrence in the code in which the FS variant (whether this is FAT12, FAT16 or FAT32) was previously checked using msdos_sb_info->fat_bits. Link: http://lkml.kernel.org/r/1544990640-11604-4-git-send-email-carmeli.tamir@gmail.com Signed-off-by: Carmeli Tamir <carmeli.tamir@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fat: move MAX_FAT to fat.h and change it to inline functionCarmeli Tamir2-1/+10
MAX_FAT is useless in msdos_fs.h, since it uses the MSDOS_SB function that is defined in fat.h. So really, this macro can be only called from code that already includes fat.h. Hence, this patch moves it to fat.h, right after MSDOS_SB is defined. I also changed it to an inline function in order to save the double call to MSDOS_SB. This was suggested by joe@perches.com in the previous version. This patch is required for the next in the series, in which the variant (whether this is FAT12, FAT16 or FAT32) checks are replaced with new macros. Link: http://lkml.kernel.org/r/1544990640-11604-3-git-send-email-carmeli.tamir@gmail.com Signed-off-by: Carmeli Tamir <carmeli.tamir@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fat: remove FAT_FIRST_ENT macroCarmeli Tamir1-4/+8
The comment edited in this patch was the only reference to the FAT_FIRST_ENT macro, which is not used anymore. Moreover, the commented line of code does not compile with the current code. Since the FAT_FIRST_ENT macro checks the FAT variant in a way that the patch series changes, I removed it, and instead wrote a clear explanation of what was checked. I verified that the changed comment is correct according to Microsoft FAT spec, search for "BPB_Media" in the following references: 1. Microsoft FAT specification 2005 (http://read.pudn.com/downloads77/ebook/294884/FAT32%20Spec%20%28SDA%20Contribution%29.pdf). Search for 'volume label'. 2. Microsoft Extensible Firmware Initiative, FAT32 File System Specification (https://staff.washington.edu/dittrich/misc/fatgen103.pdf). Search for 'volume label'. Link: http://lkml.kernel.org/r/1544990640-11604-2-git-send-email-carmeli.tamir@gmail.com Signed-off-by: Carmeli Tamir <carmeli.tamir@gmail.com> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04hfsplus: return file attributes on statxErnesto A. Fernández3-0/+24
The immutable, append-only and no-dump attributes can only be retrieved with an ioctl; implement the ->getattr() method to return them on statx. Do not return the inode birthtime yet, because the issue of how best to handle the post-2038 timestamps is still under discussion. This patch is needed to pass xfstests generic/424. Link: http://lkml.kernel.org/r/20181014163558.sxorxlzjqccq2lpw@eaf Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com> Cc: Viacheslav Dubeyko <slava@dubeyko.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04autofs: add strictexpire mount optionIan Kent3-2/+12
Commit 092a53452bb7 ("autofs: take more care to not update last_used on path walk") helped to (partially) resolve a problem where automounts were not expiring due to aggressive accesses from user space. This patch was later reverted because, for very large environments, it meant more mount requests from clients and when there are a lot of clients this caused a fairly significant increase in server load. But there is a need for both types of expire check, depending on use case, so add a mount option to allow for strict update of last use of autofs dentrys (which just means not updating the last use on path walk access). Link: http://lkml.kernel.org/r/154296973880.9889.14085372741514507967.stgit@pluto-themaw-net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04autofs: change catatonic setting to a bit flagIan Kent5-16/+20
Change the superblock info. catatonic setting to be part of a flags bit field. Link: http://lkml.kernel.org/r/154296973142.9889.17275721668508589639.stgit@pluto-themaw-net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04autofs: simplify parse_options() function callIan Kent1-26/+29
The parse_options() function uses a long list of parameters, most of which are present in the super block info structure already. The mount parameters set in parse_options() options don't require cleanup so using the super block info struct directly is simpler. Link: http://lkml.kernel.org/r/154296972423.9889.9368859245676473329.stgit@pluto-themaw-net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04autofs: improve ioctl sbi checksIan Kent3-21/+9
Al Viro made some suggestions to improve the implementation of commit 0633da48f0 ("fix autofs_sbi() does not check super block type"). The check is unnecessary in all cases except for ioctl usage so placing the check in the super block accessor function adds a small overhead to the common case where it isn't needed. So it's sufficient to do this in the ioctl code only. Also the check in the ioctl code is needlessly complex. [akpm@linux-foundation.org: declare autofs_fs_type in .h, not .c] Link: http://lkml.kernel.org/r/154296970987.9889.1597442413573683096.stgit@pluto-themaw-net Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/epoll: deal with wait_queue only onceDavidlohr Bueso1-11/+18
There is no reason why we rearm the waitiqueue upon every fetch_events retry (for when events are found yet send_events() fails). If nothing else, this saves four lock operations per retry, and furthermore reduces the scope of the lock even further. [akpm@linux-foundation.org: restore code to original position, fix and reflow comment] Link: http://lkml.kernel.org/r/20181114182532.27981-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/epoll: rename check_events label to send_eventsDavidlohr Bueso1-3/+3
It is currently called check_events because it, well, did exactly that. However, since the lockless ep_events_available() call, the label no longer checks, but just sends the events. Rename as such. Link: http://lkml.kernel.org/r/20181114182532.27981-1-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/epoll: avoid barrier after an epoll_wait(2) timeoutDavidlohr Bueso1-2/+6
Upon timeout, we can just exit out of the loop, without the cost of the changing the task's state with an smp_store_mb call. Just exit out of the loop and be done - setting the task state afterwards will be, of course, redundant. [dave@stgolabs.net: forgotten fixlets] Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5 Link: http://lkml.kernel.org/r/20181108051006.18751-7-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/epoll: reduce the scope of wq lock in epoll_wait()Davidlohr Bueso1-54/+60
This patch aims at reducing ep wq.lock hold times in epoll_wait(2). For the blocking case, there is no need to constantly take and drop the spinlock, which is only needed to manipulate the waitqueue. The call to ep_events_available() is now lockless, and only exposed to benign races. Here, if false positive (returns available events and does not see another thread deleting an epi from the list) we call into send_events and then the list's state is correctly seen. Otoh, if a false negative and we don't see a list_add_tail(), for example, from irq callback, then it is rechecked again before blocking, which will see the correct state. In order for more accuracy to see concurrent list_del_init(), use the list_empty_careful() variant -- of course, this won't be safe against insertions from wakeup. For the overflow list we obviously need to prevent load/store tearing as we don't want to see partial values while the ready list is disabled. [dave@stgolabs.net: forgotten fixlets] Link: http://lkml.kernel.org/r/20181109155258.jxcr4t2pnz6zqct3@linux-r8p5 Link: http://lkml.kernel.org/r/20181108051006.18751-6-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Suggested-by: Jason Baron <jbaron@akamai.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/epoll: robustify ep->mtx held checksDavidlohr Bueso1-0/+2
Insted of just commenting how important it is, lets make it more robust and add a lockdep_assert_held() call. Link: http://lkml.kernel.org/r/20181108051006.18751-5-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/epoll: drop ovflist branch predictionDavidlohr Bueso1-1/+1
The ep->ovflist is a secondary ready-list to temporarily store events that might occur when doing sproc without holding the ep->wq.lock. This accounts for every time we check for ready events and also send events back to userspace; both callbacks, particularly the latter because of copy_to_user, can account for a non-trivial time. As such, the unlikely() check to see if the pointer is being used, seems both misleading and sub-optimal. In fact, we go to an awful lot of trouble to sync both lists, and populating the ovflist is far from an uncommon scenario. For example, profiling a concurrent epoll_wait(2) benchmark, with CONFIG_PROFILE_ANNOTATED_BRANCHES shows that for a two threads a 33% incorrect rate was seen; and when incrementally increasing the number of epoll instances (which is used, for example for multiple queuing load balancing models), up to a 90% incorrect rate was seen. Similarly, by deleting the prediction, 3% throughput boost was seen across incremental threads. Link: http://lkml.kernel.org/r/20181108051006.18751-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/epoll: simplify ep_send_events_proc() ready-list loopDavidlohr Bueso1-36/+37
The current logic is a bit convoluted. Lets simplify this with a standard list_for_each_entry_safe() loop instead and just break out after maxevents is reached. While at it, remove an unnecessary indentation level in the loop when there are in fact ready events. Link: http://lkml.kernel.org/r/20181108051006.18751-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-04fs/epoll: remove max_nests argument from ep_call_nested()Davidlohr Bueso1-8/+6
Patch series "epoll: some miscellaneous optimizations". The following are some incremental optimizations on some of the epoll core. Each patch has the details, but together, the series is seen to shave off measurable cycles on a number of systems and workloads. For example, on a 40-core IB, a pipetest as well as parallel epoll_wait() benchmark show around a 20-30% increase in raw operations per second when the box is fully occupied (incremental thread counts), and up to 15% performance improvement with lower counts. Passes ltp epoll related testcases. This patch(of 6): All callers pass the EP_MAX_NESTS constant already, so lets simplify this a tad and get rid of the redundant parameter for nested eventpolls. Link: http://lkml.kernel.org/r/20181108051006.18751-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>