aboutsummaryrefslogtreecommitdiffstats
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-06-16btrfs: fix data block group relocation failure due to concurrent scrubFilipe Manana1-1/+18
When running relocation of a data block group while scrub is running in parallel, it is possible that the relocation will fail and abort the current transaction with an -EINVAL error: [134243.988595] BTRFS info (device sdc): found 14 extents, stage: move data extents [134243.999871] ------------[ cut here ]------------ [134244.000741] BTRFS: Transaction aborted (error -22) [134244.001692] WARNING: CPU: 0 PID: 26954 at fs/btrfs/ctree.c:1071 __btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.003380] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) [134244.012577] CPU: 0 PID: 26954 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 [134244.014162] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [134244.016184] RIP: 0010:__btrfs_cow_block+0x6a7/0x790 [btrfs] [134244.017151] Code: 48 c7 c7 (...) [134244.020549] RSP: 0018:ffffa41607863888 EFLAGS: 00010286 [134244.021515] RAX: 0000000000000000 RBX: ffff9614bdfe09c8 RCX: 0000000000000000 [134244.022822] RDX: 0000000000000001 RSI: ffffffffb3d63980 RDI: 0000000000000001 [134244.024124] RBP: ffff961589e8c000 R08: 0000000000000000 R09: 0000000000000001 [134244.025424] R10: ffffffffc0ae5955 R11: 0000000000000000 R12: ffff9614bd530d08 [134244.026725] R13: ffff9614ced41b88 R14: ffff9614bdfe2a48 R15: 0000000000000000 [134244.028024] FS: 00007f29b63c08c0(0000) GS:ffff9615ba600000(0000) knlGS:0000000000000000 [134244.029491] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [134244.030560] CR2: 00007f4eb339b000 CR3: 0000000130d6e006 CR4: 00000000003606f0 [134244.031997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [134244.033153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [134244.034484] Call Trace: [134244.034984] btrfs_cow_block+0x12b/0x2b0 [btrfs] [134244.035859] do_relocation+0x30b/0x790 [btrfs] [134244.036681] ? do_raw_spin_unlock+0x49/0xc0 [134244.037460] ? _raw_spin_unlock+0x29/0x40 [134244.038235] relocate_tree_blocks+0x37b/0x730 [btrfs] [134244.039245] relocate_block_group+0x388/0x770 [btrfs] [134244.040228] btrfs_relocate_block_group+0x161/0x2e0 [btrfs] [134244.041323] btrfs_relocate_chunk+0x36/0x110 [btrfs] [134244.041345] btrfs_balance+0xc06/0x1860 [btrfs] [134244.043382] ? btrfs_ioctl_balance+0x27c/0x310 [btrfs] [134244.045586] btrfs_ioctl_balance+0x1ed/0x310 [btrfs] [134244.045611] btrfs_ioctl+0x1880/0x3760 [btrfs] [134244.049043] ? do_raw_spin_unlock+0x49/0xc0 [134244.049838] ? _raw_spin_unlock+0x29/0x40 [134244.050587] ? __handle_mm_fault+0x11b3/0x14b0 [134244.051417] ? ksys_ioctl+0x92/0xb0 [134244.052070] ksys_ioctl+0x92/0xb0 [134244.052701] ? trace_hardirqs_off_thunk+0x1a/0x1c [134244.053511] __x64_sys_ioctl+0x16/0x20 [134244.054206] do_syscall_64+0x5c/0x280 [134244.054891] entry_SYSCALL_64_after_hwframe+0x49/0xbe [134244.055819] RIP: 0033:0x7f29b51c9dd7 [134244.056491] Code: 00 00 00 (...) [134244.059767] RSP: 002b:00007ffcccc1dd08 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [134244.061168] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f29b51c9dd7 [134244.062474] RDX: 00007ffcccc1dda0 RSI: 00000000c4009420 RDI: 0000000000000003 [134244.063771] RBP: 0000000000000003 R08: 00005565cea4b000 R09: 0000000000000000 [134244.065032] R10: 0000000000000541 R11: 0000000000000202 R12: 00007ffcccc2060a [134244.066327] R13: 00007ffcccc1dda0 R14: 0000000000000002 R15: 00007ffcccc1dec0 [134244.067626] irq event stamp: 0 [134244.068202] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [134244.069351] hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.070909] softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 [134244.072392] softirqs last disabled at (0): [<0000000000000000>] 0x0 [134244.073432] ---[ end trace bd7c03622e0b0a99 ]--- The -EINVAL error comes from the following chain of function calls: __btrfs_cow_block() <-- aborts the transaction btrfs_reloc_cow_block() replace_file_extents() get_new_location() <-- returns -EINVAL When relocating a data block group, for each allocated extent of the block group, we preallocate another extent (at prealloc_file_extent_cluster()), associated with the data relocation inode, and then dirty all its pages. These preallocated extents have, and must have, the same size that extents from the data block group being relocated have. Later before we start the relocation stage that updates pointers (bytenr field of file extent items) to point to the the new extents, we trigger writeback for the data relocation inode. The expectation is that writeback will write the pages to the previously preallocated extents, that it follows the NOCOW path. That is generally the case, however, if a scrub is running it may have turned the block group that contains those extents into RO mode, in which case writeback falls back to the COW path. However in the COW path instead of allocating exactly one extent with the expected size, the allocator may end up allocating several smaller extents due to free space fragmentation - because we tell it at cow_file_range() that the minimum allocation size can match the filesystem's sector size. This later breaks the relocation's expectation that an extent associated to a file extent item in the data relocation inode has the same size as the respective extent pointed by a file extent item in another tree - in this case the extent to which the relocation inode poins to is smaller, causing relocation.c:get_new_location() to return -EINVAL. For example, if we are relocating a data block group X that has a logical address of X and the block group has an extent allocated at the logical address X + 128KiB with a size of 64KiB: 1) At prealloc_file_extent_cluster() we allocate an extent for the data relocation inode with a size of 64KiB and associate it to the file offset 128KiB (X + 128KiB - X) of the data relocation inode. This preallocated extent was allocated at block group Z; 2) A scrub running in parallel turns block group Z into RO mode and starts scrubing its extents; 3) Relocation triggers writeback for the data relocation inode; 4) When running delalloc (btrfs_run_delalloc_range()), we try first the NOCOW path because the data relocation inode has BTRFS_INODE_PREALLOC set in its flags. However, because block group Z is in RO mode, the NOCOW path (run_delalloc_nocow()) falls back into the COW path, by calling cow_file_range(); 5) At cow_file_range(), in the first iteration of the while loop we call btrfs_reserve_extent() to allocate a 64KiB extent and pass it a minimum allocation size of 4KiB (fs_info->sectorsize). Due to free space fragmentation, btrfs_reserve_extent() ends up allocating two extents of 32KiB each, each one on a different iteration of that while loop; 6) Writeback of the data relocation inode completes; 7) Relocation proceeds and ends up at relocation.c:replace_file_extents(), with a leaf which has a file extent item that points to the data extent from block group X, that has a logical address (bytenr) of X + 128KiB and a size of 64KiB. Then it calls get_new_location(), which does a lookup in the data relocation tree for a file extent item starting at offset 128KiB (X + 128KiB - X) and belonging to the data relocation inode. It finds a corresponding file extent item, however that item points to an extent that has a size of 32KiB, which doesn't match the expected size of 64KiB, resuling in -EINVAL being returned from this function and propagated up to __btrfs_cow_block(), which aborts the current transaction. To fix this make sure that at cow_file_range() when we call the allocator we pass it a minimum allocation size corresponding the desired extent size if the inode belongs to the data relocation tree, otherwise pass it the filesystem's sector size as the minimum allocation size. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: fix race between block group removal and block group creationFilipe Manana1-8/+19
There is a race between block group removal and block group creation when the removal is completed by a task running fitrim or scrub. When this happens we end up failing the block group creation with an error -EEXIST since we attempt to insert a duplicate block group item key in the extent tree. That results in a transaction abort. The race happens like this: 1) Task A is doing a fitrim, and at btrfs_trim_block_group() it freezes block group X with btrfs_freeze_block_group() (until very recently that was named btrfs_get_block_group_trimming()); 2) Task B starts removing block group X, either because it's now unused or due to relocation for example. So at btrfs_remove_block_group(), while holding the chunk mutex and the block group's lock, it sets the 'removed' flag of the block group and it sets the local variable 'remove_em' to false, because the block group is currently frozen (its 'frozen' counter is > 0, until very recently this counter was named 'trimming'); 3) Task B unlocks the block group and the chunk mutex; 4) Task A is done trimming the block group and unfreezes the block group by calling btrfs_unfreeze_block_group() (until very recently this was named btrfs_put_block_group_trimming()). In this function we lock the block group and set the local variable 'cleanup' to true because we were able to decrement the block group's 'frozen' counter down to 0 and the flag 'removed' is set in the block group. Since 'cleanup' is set to true, it locks the chunk mutex and removes the extent mapping representing the block group from the mapping tree; 5) Task C allocates a new block group Y and it picks up the logical address that block group X had as the logical address for Y, because X was the block group with the highest logical address and now the second block group with the highest logical address, the last in the fs mapping tree, ends at an offset corresponding to block group X's logical address (this logical address selection is done at volumes.c:find_next_chunk()). At this point the new block group Y does not have yet its item added to the extent tree (nor the corresponding device extent items and chunk item in the device and chunk trees). The new group Y is added to the list of pending block groups in the transaction handle; 6) Before task B proceeds to removing the block group item for block group X from the extent tree, which has a key matching: (X logical offset, BTRFS_BLOCK_GROUP_ITEM_KEY, length) task C while ending its transaction handle calls btrfs_create_pending_block_groups(), which finds block group Y and tries to insert the block group item for Y into the exten tree, which fails with -EEXIST since logical offset is the same that X had and task B hasn't yet deleted the key from the extent tree. This failure results in a transaction abort, producing a stack like the following: ------------[ cut here ]------------ BTRFS: Transaction aborted (error -17) WARNING: CPU: 2 PID: 19736 at fs/btrfs/block-group.c:2074 btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs] Modules linked in: btrfs blake2b_generic xor raid6_pq (...) CPU: 2 PID: 19736 Comm: fsstress Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_create_pending_block_groups+0x1eb/0x260 [btrfs] Code: ff ff ff 48 8b 55 50 f0 48 (...) RSP: 0018:ffffa4160a1c7d58 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff961581909d98 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffb3d63990 RDI: 0000000000000001 RBP: ffff9614f3356a58 R08: 0000000000000000 R09: 0000000000000001 R10: ffff9615b65b0040 R11: 0000000000000000 R12: ffff961581909c10 R13: ffff9615b0c32000 R14: ffff9614f3356ab0 R15: ffff9614be779000 FS: 00007f2ce2841e80(0000) GS:ffff9615bae00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000555f18780000 CR3: 0000000131d34005 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_start_dirty_block_groups+0x398/0x4e0 [btrfs] btrfs_commit_transaction+0xd0/0xc50 [btrfs] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] ? __ia32_sys_fdatasync+0x20/0x20 iterate_supers+0xdb/0x180 ksys_sync+0x60/0xb0 __ia32_sys_sync+0xa/0x10 do_syscall_64+0x5c/0x280 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f2ce1d4d5b7 Code: 83 c4 08 48 3d 01 (...) RSP: 002b:00007ffd8b558c58 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2 RAX: ffffffffffffffda RBX: 000000000000002c RCX: 00007f2ce1d4d5b7 RDX: 00000000ffffffff RSI: 00000000186ba07b RDI: 000000000000002c RBP: 0000555f17b9e520 R08: 0000000000000012 R09: 000000000000ce00 R10: 0000000000000078 R11: 0000000000000202 R12: 0000000000000032 R13: 0000000051eb851f R14: 00007ffd8b558cd0 R15: 0000555f1798ec20 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last enabled at (0): [<ffffffffb2abdedf>] copy_process+0x74f/0x2020 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace bd7c03622e0b0a9c ]--- Fix this simply by making btrfs_remove_block_group() remove the block group's item from the extent tree before it flags the block group as removed. Also make the free space deletion from the free space tree before flagging the block group as removed, to avoid a similar race with adding and removing free space entries for the free space tree. Fixes: 04216820fe83d5 ("Btrfs: fix race between fs trimming and block group remove/allocation") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16btrfs: fix a block group ref counter leak after failure to remove block groupFilipe Manana1-10/+9
When removing a block group, if we fail to delete the block group's item from the extent tree, we jump to the 'out' label and end up decrementing the block group's reference count once only (by 1), resulting in a counter leak because the block group at that point was already removed from the block group cache rbtree - so we have to decrement the reference count twice, once for the rbtree and once for our lookup at the start of the function. There is a second bug where if removing the free space tree entries (the call to remove_block_group_free_space()) fails we end up jumping to the 'out_put_group' label but end up decrementing the reference count only once, when we should have done it twice, since we have already removed the block group from the block group cache rbtree. This happens because the reference count decrement for the rbtree reference happens after attempting to remove the free space tree entries, which is far away from the place where we remove the block group from the rbtree. To make things less error prone, decrement the reference count for the rbtree immediately after removing the block group from it. This also eleminates the need for two different exit labels on error, renaming 'out_put_label' to just 'out' and removing the old 'out'. Fixes: f6033c5e333238 ("btrfs: fix block group leak when removing fails") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-16block: Fix use-after-free in blkdev_get()Jason Yan1-5/+7
In blkdev_get() we call __blkdev_get() to do some internal jobs and if there is some errors in __blkdev_get(), the bdput() is called which means we have released the refcount of the bdev (actually the refcount of the bdev inode). This means we cannot access bdev after that point. But acctually bdev is still accessed in blkdev_get() after calling __blkdev_get(). This results in use-after-free if the refcount is the last one we released in __blkdev_get(). Let's take a look at the following scenerio: CPU0 CPU1 CPU2 blkdev_open blkdev_open Remove disk bd_acquire blkdev_get __blkdev_get del_gendisk bdev_unhash_inode bd_acquire bdev_get_gendisk bd_forget failed because of unhashed bdput bdput (the last one) bdev_evict_inode access bdev => use after free [ 459.350216] BUG: KASAN: use-after-free in __lock_acquire+0x24c1/0x31b0 [ 459.351190] Read of size 8 at addr ffff88806c815a80 by task syz-executor.0/20132 [ 459.352347] [ 459.352594] CPU: 0 PID: 20132 Comm: syz-executor.0 Not tainted 4.19.90 #2 [ 459.353628] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 [ 459.354947] Call Trace: [ 459.355337] dump_stack+0x111/0x19e [ 459.355879] ? __lock_acquire+0x24c1/0x31b0 [ 459.356523] print_address_description+0x60/0x223 [ 459.357248] ? __lock_acquire+0x24c1/0x31b0 [ 459.357887] kasan_report.cold+0xae/0x2d8 [ 459.358503] __lock_acquire+0x24c1/0x31b0 [ 459.359120] ? _raw_spin_unlock_irq+0x24/0x40 [ 459.359784] ? lockdep_hardirqs_on+0x37b/0x580 [ 459.360465] ? _raw_spin_unlock_irq+0x24/0x40 [ 459.361123] ? finish_task_switch+0x125/0x600 [ 459.361812] ? finish_task_switch+0xee/0x600 [ 459.362471] ? mark_held_locks+0xf0/0xf0 [ 459.363108] ? __schedule+0x96f/0x21d0 [ 459.363716] lock_acquire+0x111/0x320 [ 459.364285] ? blkdev_get+0xce/0xbe0 [ 459.364846] ? blkdev_get+0xce/0xbe0 [ 459.365390] __mutex_lock+0xf9/0x12a0 [ 459.365948] ? blkdev_get+0xce/0xbe0 [ 459.366493] ? bdev_evict_inode+0x1f0/0x1f0 [ 459.367130] ? blkdev_get+0xce/0xbe0 [ 459.367678] ? destroy_inode+0xbc/0x110 [ 459.368261] ? mutex_trylock+0x1a0/0x1a0 [ 459.368867] ? __blkdev_get+0x3e6/0x1280 [ 459.369463] ? bdev_disk_changed+0x1d0/0x1d0 [ 459.370114] ? blkdev_get+0xce/0xbe0 [ 459.370656] blkdev_get+0xce/0xbe0 [ 459.371178] ? find_held_lock+0x2c/0x110 [ 459.371774] ? __blkdev_get+0x1280/0x1280 [ 459.372383] ? lock_downgrade+0x680/0x680 [ 459.373002] ? lock_acquire+0x111/0x320 [ 459.373587] ? bd_acquire+0x21/0x2c0 [ 459.374134] ? do_raw_spin_unlock+0x4f/0x250 [ 459.374780] blkdev_open+0x202/0x290 [ 459.375325] do_dentry_open+0x49e/0x1050 [ 459.375924] ? blkdev_get_by_dev+0x70/0x70 [ 459.376543] ? __x64_sys_fchdir+0x1f0/0x1f0 [ 459.377192] ? inode_permission+0xbe/0x3a0 [ 459.377818] path_openat+0x148c/0x3f50 [ 459.378392] ? kmem_cache_alloc+0xd5/0x280 [ 459.379016] ? entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.379802] ? path_lookupat.isra.0+0x900/0x900 [ 459.380489] ? __lock_is_held+0xad/0x140 [ 459.381093] do_filp_open+0x1a1/0x280 [ 459.381654] ? may_open_dev+0xf0/0xf0 [ 459.382214] ? find_held_lock+0x2c/0x110 [ 459.382816] ? lock_downgrade+0x680/0x680 [ 459.383425] ? __lock_is_held+0xad/0x140 [ 459.384024] ? do_raw_spin_unlock+0x4f/0x250 [ 459.384668] ? _raw_spin_unlock+0x1f/0x30 [ 459.385280] ? __alloc_fd+0x448/0x560 [ 459.385841] do_sys_open+0x3c3/0x500 [ 459.386386] ? filp_open+0x70/0x70 [ 459.386911] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 459.387610] ? trace_hardirqs_off_caller+0x55/0x1c0 [ 459.388342] ? do_syscall_64+0x1a/0x520 [ 459.388930] do_syscall_64+0xc3/0x520 [ 459.389490] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.390248] RIP: 0033:0x416211 [ 459.390720] Code: 75 14 b8 02 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 04 19 00 00 c3 48 83 ec 08 e8 0a fa ff ff 48 89 04 24 b8 02 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 53 fa ff ff 48 89 d0 48 83 c4 08 48 3d 01 [ 459.393483] RSP: 002b:00007fe45dfe9a60 EFLAGS: 00000293 ORIG_RAX: 0000000000000002 [ 459.394610] RAX: ffffffffffffffda RBX: 00007fe45dfea6d4 RCX: 0000000000416211 [ 459.395678] RDX: 00007fe45dfe9b0a RSI: 0000000000000002 RDI: 00007fe45dfe9b00 [ 459.396758] RBP: 000000000076bf20 R08: 0000000000000000 R09: 000000000000000a [ 459.397930] R10: 0000000000000075 R11: 0000000000000293 R12: 00000000ffffffff [ 459.399022] R13: 0000000000000bd9 R14: 00000000004cdb80 R15: 000000000076bf2c [ 459.400168] [ 459.400430] Allocated by task 20132: [ 459.401038] kasan_kmalloc+0xbf/0xe0 [ 459.401652] kmem_cache_alloc+0xd5/0x280 [ 459.402330] bdev_alloc_inode+0x18/0x40 [ 459.402970] alloc_inode+0x5f/0x180 [ 459.403510] iget5_locked+0x57/0xd0 [ 459.404095] bdget+0x94/0x4e0 [ 459.404607] bd_acquire+0xfa/0x2c0 [ 459.405113] blkdev_open+0x110/0x290 [ 459.405702] do_dentry_open+0x49e/0x1050 [ 459.406340] path_openat+0x148c/0x3f50 [ 459.406926] do_filp_open+0x1a1/0x280 [ 459.407471] do_sys_open+0x3c3/0x500 [ 459.408010] do_syscall_64+0xc3/0x520 [ 459.408572] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 459.409415] [ 459.409679] Freed by task 1262: [ 459.410212] __kasan_slab_free+0x129/0x170 [ 459.410919] kmem_cache_free+0xb2/0x2a0 [ 459.411564] rcu_process_callbacks+0xbb2/0x2320 [ 459.412318] __do_softirq+0x225/0x8ac Fix this by delaying bdput() to the end of blkdev_get() which means we have finished accessing bdev. Fixes: 77ea887e433a ("implement in-kernel gendisk events handling") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Jason Yan <yanaijie@huawei.com> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-16afs: afs_vnode_commit_status() doesn't need to check the RPC errorDavid Howells1-3/+1
afs_vnode_commit_status() is only ever called if op->error is 0, so remove the op->error checks from the function. Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16afs: Fix use of afs_check_for_remote_deletion()David Howells7-18/+32
afs_check_for_remote_deletion() checks to see if error ENOENT is returned by the server in response to an operation and, if so, marks the primary vnode as having been deleted as the FID is no longer valid. However, it's being called from the operation success functions, where no abort has happened - and if an inline abort is recorded, it's handled by afs_vnode_commit_status(). Fix this by actually calling the operation aborted method if provided and having that point to afs_check_for_remote_deletion(). Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16afs: Remove afs_operation::abort_codeDavid Howells2-2/+1
Remove afs_operation::abort_code as it's read but never set. Use ac.abort_code instead. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16afs: Fix yfs_fs_fetch_status() to honour vnode selectorDavid Howells1-25/+26
Fix yfs_fs_fetch_status() to honour the vnode selector in op->fetch_status.which as does afs_fs_fetch_status() that allows afs_do_lookup() to use this as an alternative to the InlineBulkStatus RPC call if not implemented by the server. This doesn't matter in the current code as YFS servers always implement InlineBulkStatus, but a subsequent will call it on YFS servers too in some circumstances. Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16afs: Remove yfs_fs_fetch_file_status() as it's not usedDavid Howells2-43/+0
Remove yfs_fs_fetch_file_status() as it's no longer used. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16fs: Do not check if there is a fsnotify watcher on pseudo inodesMel Gorman1-1/+1
The kernel uses internal mounts created by kern_mount() and populated with files with no lookup path by alloc_file_pseudo() for a variety of reasons. An example of such a mount is for anonymous pipes. For pipes, every vfs_write() regardless of filesystem, calls fsnotify_modify() to notify of any changes which incurs a small amount of overhead in fsnotify even when there are no watchers. It can also trigger for reads and readv and writev, it was simply vfs_write() that was noticed first. A patch is pending that reduces, but does not eliminate, the overhead of fsnotify but for files that cannot be looked up via a path, even that small overhead is unnecessary. The user API for all notification subsystems (inotify, fanotify, ...) is based on the pathname and a dirfd and proc entries appear to be the only visible representation of the files. Proc does not have the same pathname as the internal entry and the proc inode is not the same as the internal inode so even if fanotify is used on a file under /proc/XX/fd, no useful events are notified. This patch changes alloc_file_pseudo() to always opt out of fsnotify by setting FMODE_NONOTIFY flag so that no check is made for fsnotify watchers on pseudo files. This should be safe as the underlying helper for the dentry is d_alloc_pseudo() which explicitly states that no lookups are ever performed meaning that fanotify should have nothing useful to attach to. The test motivating this was "perf bench sched messaging --pipe". On a single-socket machine using threads the difference of the patch was as follows. 5.7.0 5.7.0 vanilla nofsnotify-v1r1 Amean 1 1.3837 ( 0.00%) 1.3547 ( 2.10%) Amean 3 3.7360 ( 0.00%) 3.6543 ( 2.19%) Amean 5 5.8130 ( 0.00%) 5.7233 * 1.54%* Amean 7 8.1490 ( 0.00%) 7.9730 * 2.16%* Amean 12 14.6843 ( 0.00%) 14.1820 ( 3.42%) Amean 18 21.8840 ( 0.00%) 21.7460 ( 0.63%) Amean 24 28.8697 ( 0.00%) 29.1680 ( -1.03%) Amean 30 36.0787 ( 0.00%) 35.2640 * 2.26%* Amean 32 38.0527 ( 0.00%) 38.1223 ( -0.18%) The difference is small but in some cases it's outside the noise so while marginal, there is still some small benefit to ignoring fsnotify for files allocated via alloc_file_pseudo() in some cases. Link: https://lore.kernel.org/r/20200615121358.GF3183@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>
2020-06-15Squashfs: Replace zero-length array with flexible-arrayGustavo A. R. Silva1-8/+8
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15jffs2: Replace zero-length array with flexible-arrayGustavo A. R. Silva2-3/+3
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15aio: Replace zero-length array with flexible-arrayGustavo A. R. Silva1-1/+1
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://github.com/KSPP/linux/issues/21 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-06-15Merge tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds14-69/+274
Pull more ext4 updates from Ted Ts'o: "This is the second round of ext4 commits for 5.8 merge window [1]. It includes the per-inode DAX support, which was dependant on the DAX infrastructure which came in via the XFS tree, and a number of regression and bug fixes; most notably the "BUG: using smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported by syzkaller" [1] The pull request actually came in 15 minutes after I had tagged the rc1 release. Tssk, tssk, late.. - Linus * tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers ext4: support xattr gnu.* namespace for the Hurd ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr ext4: avoid utf8_strncasecmp() with unstable name ext4: stop overwrite the errcode in ext4_setup_super ext4: fix partial cluster initialization when splitting extent ext4: avoid race conditions when remounting with options that change dax Documentation/dax: Update DAX enablement for ext4 fs/ext4: Introduce DAX inode flag fs/ext4: Remove jflag variable fs/ext4: Make DAX mount option a tri-state fs/ext4: Only change S_DAX on inode load fs/ext4: Update ext4_should_use_dax() fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS fs/ext4: Disallow verity if inode is DAX fs/ext4: Narrow scope of DAX check in setflags
2020-06-15io_uring: cancel by ->task not pidPavel Begunkov2-11/+6
For an exiting process it tries to cancel all its inflight requests. Use req->task to match such instead of work.pid. We always have req->task set, and it will be valid because we're matching only current exiting task. Also, remove work.pid and everything related, it's useless now. Reported-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15io_uring: lazy get taskPavel Begunkov1-8/+22
There will be multiple places where req->task is used, so refcount-pin it lazily with introduced *io_{get,put}_req_task(). We need to always have valid ->task for cancellation reasons, but don't care about pinning it in some cases. That's why it sets req->task in io_req_init() and implements get/put laziness with a flag. This also removes using @current from polling io_arm_poll_handler(), etc., but doesn't change observable behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15io_uring: batch cancel in io_uring_cancel_files()Pavel Begunkov1-0/+13
Instead of waiting for each request one by one, first try to cancel all of them in a batched manner, and then go over inflight_list/etc to reap leftovers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15io_uring: cancel all task's requests on exitPavel Begunkov3-17/+12
If a process is going away, io_uring_flush() will cancel only 1 request with a matching pid. Cancel all of them Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15io-wq: add an option to cancel all matched reqsPavel Begunkov3-28/+36
This adds support for cancelling all io-wq works matching a predicate. It isn't used yet, so no change in observable behaviour. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15io-wq: reorder cancellation pending -> runningPavel Begunkov1-22/+32
Go all over all pending lists and cancel works there, and only then try to match running requests. No functional changes here, just a preparation for bulk cancellation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15afs: Fix the mapping of the UAEOVERFLOW abort codeDavid Howells1-0/+1
Abort code UAEOVERFLOW is returned when we try and set a time that's out of range, but it's currently mapped to EREMOTEIO by the default case. Fix UAEOVERFLOW to map instead to EOVERFLOW. Found with the generic/258 xfstest. Note that the test is wrong as it assumes that the filesystem will support a pre-UNIX-epoch date. Fixes: 1eda8bab70ca ("afs: Add support for the UAE error table") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Fix truncation issues and mmap writeback sizeDavid Howells3-5/+35
Fix the following issues: (1) Fix writeback to reduce the size of a store operation to i_size, effectively discarding the extra data. The problem comes when afs_page_mkwrite() records that a page is about to be modified by mmap(). It doesn't know what bits of the page are going to be modified, so it records the whole page as being dirty (this is stored in page->private as start and end offsets). Without this, the marshalling for the store to the server extends the size of the file to the end of the page (in afs_fs_store_data() and yfs_fs_store_data()). (2) Fix setattr to actually truncate the pagecache, thereby clearing the discarded part of a file. (3) Fix setattr to check that the new size is okay and to disable ATTR_SIZE if i_size wouldn't change. (4) Force i_size to be updated as the result of a truncate. (5) Don't truncate if ATTR_SIZE is not set. (6) Call pagecache_isize_extended() if the file was enlarged. Note that truncate_set_size() isn't used because the setting of i_size is done inside afs_vnode_commit_status() under the vnode->cb_lock. Found with the generic/029 and generic/393 xfstests. Fixes: 31143d5d515e ("AFS: implement basic file write support") Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Concoct ctimesDavid Howells4-12/+38
The in-kernel afs filesystem ignores ctime because the AFS fileserver protocol doesn't support ctimes. This, however, causes various xfstests to fail. Work around this by: (1) Setting ctime to attr->ia_ctime in afs_setattr(). (2) Not ignoring ATTR_MTIME_SET, ATTR_TIMES_SET and ATTR_TOUCH settings. (3) Setting the ctime from the server mtime when on the target file when creating a hard link to it. (4) Setting the ctime on directories from their revised mtimes when renaming/moving a file. Found by the generic/221 and generic/309 xfstests. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Fix EOF corruptionDavid Howells1-1/+11
When doing a partial writeback, afs_write_back_from_locked_page() may generate an FS.StoreData RPC request that writes out part of a file when a file has been constructed from pieces by doing seek, write, seek, write, ... as is done by ld. The FS.StoreData RPC is given the current i_size as the file length, but the server basically ignores it unless the data length is 0 (in which case it's just a truncate operation). The revised file length returned in the result of the RPC may then not reflect what we suggested - and this leads to i_size getting moved backwards - which causes issues later. Fix the client to take account of this by ignoring the returned file size unless the data version number jumped unexpectedly - in which case we're going to have to clear the pagecache and reload anyway. This can be observed when doing a kernel build on an AFS mount. The following pair of commands produce the issue: ld -m elf_x86_64 -z max-page-size=0x200000 --emit-relocs \ -T arch/x86/realmode/rm/realmode.lds \ arch/x86/realmode/rm/header.o \ arch/x86/realmode/rm/trampoline_64.o \ arch/x86/realmode/rm/stack.o \ arch/x86/realmode/rm/reboot.o \ -o arch/x86/realmode/rm/realmode.elf arch/x86/tools/relocs --realmode \ arch/x86/realmode/rm/realmode.elf \ >arch/x86/realmode/rm/realmode.relocs This results in the latter giving: Cannot read ELF section headers 0/18: Success as the realmode.elf file got corrupted. The sequence of events can also be driven with: xfs_io -t -f \ -c "pwrite -S 0x58 0 0x58" \ -c "pwrite -S 0x59 10000 1000" \ -c "close" \ /afs/example.com/scratch/a Fixes: 31143d5d515e ("AFS: implement basic file write support") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: afs_write_end() should change i_size under the right lockDavid Howells1-2/+2
Fix afs_write_end() to change i_size under vnode->cb_lock rather than ->wb_lock so that it doesn't race with afs_vnode_commit_status() and afs_getattr(). The ->wb_lock is only meant to guard access to ->wb_keys which isn't accessed by that piece of code. Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Fix non-setting of mtime when writing into mmapDavid Howells1-0/+1
The mtime on an inode needs to be updated when a write is made into an mmap'ed section. There are three ways in which this could be done: update it when page_mkwrite is called, update it when a page is changed from dirty to writeback or leave it to the server and fix the mtime up from the reply to the StoreData RPC. Found with the generic/215 xfstest. Fixes: 1cf7a1518aef ("afs: Implement shared-writeable mmap") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15io_uring: fix lazy work initPavel Begunkov1-0/+1
Don't leave garbage in req.work before punting async on -EAGAIN in io_iopoll_queue(). [ 140.922099] general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI ... [ 140.922105] RIP: 0010:io_worker_handle_work+0x1db/0x480 ... [ 140.922114] Call Trace: [ 140.922118] ? __next_timer_interrupt+0xe0/0xe0 [ 140.922119] io_wqe_worker+0x2a9/0x360 [ 140.922121] ? _raw_spin_unlock_irqrestore+0x24/0x40 [ 140.922124] kthread+0x12c/0x170 [ 140.922125] ? io_worker_handle_work+0x480/0x480 [ 140.922126] ? kthread_park+0x90/0x90 [ 140.922127] ret_from_fork+0x22/0x30 Fixes: 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-15efivarfs: Don't return -EINTR when rate-limiting readsTony Luck1-4/+2
Applications that read EFI variables may see a return value of -EINTR if they exceed the rate limit and a signal delivery is attempted while the process is sleeping. This is quite surprising to the application, which probably doesn't have code to handle it. Change the interruptible sleep to a non-interruptible one. Reported-by: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200528194905.690-3-tony.luck@intel.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-06-15efivarfs: Update inode modification time for successful writesTony Luck1-0/+1
Some applications want to be able to see when EFI variables have been updated. Update the modification time for successful writes. Reported-by: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20200528194905.690-2-tony.luck@intel.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-06-14Merge tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds6-234/+284
Pull btrfs updates from David Sterba: "This reverts the direct io port to iomap infrastructure of btrfs merged in the first pull request. We found problems in invalidate page that don't seem to be fixable as regressions or without changing iomap code that would not affect other filesystems. There are four reverts in total, but three of them are followup cleanups needed to revert a43a67a2d715 cleanly. The result is the buffer head based implementation of direct io. Reverts are not great, but under current circumstances I don't see better options" * tag 'for-5.8-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Revert "btrfs: switch to iomap_dio_rw() for dio" Revert "fs: remove dio_end_io()" Revert "btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK" Revert "btrfs: split btrfs_direct_IO to read and write part"
2020-06-14Revert "btrfs: switch to iomap_dio_rw() for dio"David Sterba4-166/+169
This reverts commit a43a67a2d715540c1368b9501a22b0373b5874c0. This patch reverts the main part of switching direct io implementation to iomap infrastructure. There's a problem in invalidate page that couldn't be solved as regression in this development cycle. The problem occurs when buffered and direct io are mixed, and the ranges overlap. Although this is not recommended, filesystems implement measures or fallbacks to make it somehow work. In this case, fallback to buffered IO would be an option for btrfs (this already happens when direct io is done on compressed data), but the change would be needed in the iomap code, bringing new semantics to other filesystems. Another problem arises when again the buffered and direct ios are mixed, invalidation fails, then -EIO is set on the mapping and fsync will fail, though there's no real error. There have been discussions how to fix that, but revert seems to be the least intrusive option. Link: https://lore.kernel.org/linux-btrfs/20200528192103.xm45qoxqmkw7i5yl@fiona/ Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-13Merge tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds19-57/+571
Pull more cifs updates from Steve French: "12 cifs/smb3 fixes, 2 for stable. - add support for idsfromsid on create and chgrp/chown allowing ability to save owner information more naturally for some workloads - improve query info (getattr) when SMB3.1.1 posix extensions are negotiated by using new query info level" * tag '5.8-rc-smb3-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: smb3: Add debug message for new file creation with idsfromsid mount option cifs: fix chown and chgrp when idsfromsid mount option enabled smb3: allow uid and gid owners to be set on create with idsfromsid mount option smb311: Add tracepoints for new compound posix query info smb311: add support for using info level for posix extensions query smb311: Add support for lookup with posix extensions query info smb311: Add support for SMB311 query info (non-compounded) SMB311: Add support for query info using posix extensions (level 100) smb3: add indatalen that can be a non-zero value to calculation of credit charge in smb2 ioctl smb3: fix typo in mount options displayed in /proc/mounts cifs: Add get_security_type_str function to return sec type. smb3: extend fscache mount volume coherency check
2020-06-13Merge tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuildLinus Torvalds7-27/+27
Pull more Kbuild updates from Masahiro Yamada: - fix build rules in binderfs sample - fix build errors when Kbuild recurses to the top Makefile - covert '---help---' in Kconfig to 'help' * tag 'kbuild-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: treewide: replace '---help---' in Kconfig files with 'help' kbuild: fix broken builds because of GZIP,BZIP2,LZOP variables samples: binderfs: really compile this sample and fix build issues
2020-06-13Merge tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+1
Pull iomap fix from Darrick Wong: "A single iomap bug fix for a variable type mistake on 32-bit architectures, fixing an integer overflow problem in the unshare actor" * tag 'iomap-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: Fix unsharing of an extent >2GB on a 32-bit machine
2020-06-13Merge tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-1/+3
Pull xfs fix from Darrick Wong: "We've settled down into the bugfix phase; this one fixes a resource leak on an error bailout path" * tag 'xfs-5.8-merge-9' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Add the missed xfs_perag_put() for xfs_ifree_cluster()
2020-06-14treewide: replace '---help---' in Kconfig files with 'help'Masahiro Yamada7-27/+27
Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over '---help---'"), the number of '---help---' has been gradually decreasing, but there are still more than 2400 instances. This commit finishes the conversion. While I touched the lines, I also fixed the indentation. There are a variety of indentation styles found. a) 4 spaces + '---help---' b) 7 spaces + '---help---' c) 8 spaces + '---help---' d) 1 space + 1 tab + '---help---' e) 1 tab + '---help---' (correct indentation) f) 1 tab + 1 space + '---help---' g) 1 tab + 2 spaces + '---help---' In order to convert all of them to 1 tab + 'help', I ran the following commend: $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-13Merge tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fsLinus Torvalds2-76/+178
Pull notification queue from David Howells: "This adds a general notification queue concept and adds an event source for keys/keyrings, such as linking and unlinking keys and changing their attributes. Thanks to Debarshi Ray, we do have a pull request to use this to fix a problem with gnome-online-accounts - as mentioned last time: https://gitlab.gnome.org/GNOME/gnome-online-accounts/merge_requests/47 Without this, g-o-a has to constantly poll a keyring-based kerberos cache to find out if kinit has changed anything. [ There are other notification pending: mount/sb fsinfo notifications for libmount that Karel Zak and Ian Kent have been working on, and Christian Brauner would like to use them in lxc, but let's see how this one works first ] LSM hooks are included: - A set of hooks are provided that allow an LSM to rule on whether or not a watch may be set. Each of these hooks takes a different "watched object" parameter, so they're not really shareable. The LSM should use current's credentials. [Wanted by SELinux & Smack] - A hook is provided to allow an LSM to rule on whether or not a particular message may be posted to a particular queue. This is given the credentials from the event generator (which may be the system) and the watch setter. [Wanted by Smack] I've provided SELinux and Smack with implementations of some of these hooks. WHY === Key/keyring notifications are desirable because if you have your kerberos tickets in a file/directory, your Gnome desktop will monitor that using something like fanotify and tell you if your credentials cache changes. However, we also have the ability to cache your kerberos tickets in the session, user or persistent keyring so that it isn't left around on disk across a reboot or logout. Keyrings, however, cannot currently be monitored asynchronously, so the desktop has to poll for it - not so good on a laptop. This facility will allow the desktop to avoid the need to poll. DESIGN DECISIONS ================ - The notification queue is built on top of a standard pipe. Messages are effectively spliced in. The pipe is opened with a special flag: pipe2(fds, O_NOTIFICATION_PIPE); The special flag has the same value as O_EXCL (which doesn't seem like it will ever be applicable in this context)[?]. It is given up front to make it a lot easier to prohibit splice&co from accessing the pipe. [?] Should this be done some other way? I'd rather not use up a new O_* flag if I can avoid it - should I add a pipe3() system call instead? The pipe is then configured:: ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth); ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter); Messages are then read out of the pipe using read(). - It should be possible to allow write() to insert data into the notification pipes too, but this is currently disabled as the kernel has to be able to insert messages into the pipe *without* holding pipe->mutex and the code to make this work needs careful auditing. - sendfile(), splice() and vmsplice() are disabled on notification pipes because of the pipe->mutex issue and also because they sometimes want to revert what they just did - but one or more notification messages might've been interleaved in the ring. - The kernel inserts messages with the wait queue spinlock held. This means that pipe_read() and pipe_write() have to take the spinlock to update the queue pointers. - Records in the buffer are binary, typed and have a length so that they can be of varying size. This allows multiple heterogeneous sources to share a common buffer; there are 16 million types available, of which I've used just a few, so there is scope for others to be used. Tags may be specified when a watchpoint is created to help distinguish the sources. - Records are filterable as types have up to 256 subtypes that can be individually filtered. Other filtration is also available. - Notification pipes don't interfere with each other; each may be bound to a different set of watches. Any particular notification will be copied to all the queues that are currently watching for it - and only those that are watching for it. - When recording a notification, the kernel will not sleep, but will rather mark a queue as having lost a message if there's insufficient space. read() will fabricate a loss notification message at an appropriate point later. - The notification pipe is created and then watchpoints are attached to it, using one of: keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01); watch_mount(AT_FDCWD, "/", 0, fd, 0x02); watch_sb(AT_FDCWD, "/mnt", 0, fd, 0x03); where in both cases, fd indicates the queue and the number after is a tag between 0 and 255. - Watches are removed if either the notification pipe is destroyed or the watched object is destroyed. In the latter case, a message will be generated indicating the enforced watch removal. Things I want to avoid: - Introducing features that make the core VFS dependent on the network stack or networking namespaces (ie. usage of netlink). - Dumping all this stuff into dmesg and having a daemon that sits there parsing the output and distributing it as this then puts the responsibility for security into userspace and makes handling namespaces tricky. Further, dmesg might not exist or might be inaccessible inside a container. - Letting users see events they shouldn't be able to see. TESTING AND MANPAGES ==================== - The keyutils tree has a pipe-watch branch that has keyctl commands for making use of notifications. Proposed manual pages can also be found on this branch, though a couple of them really need to go to the main manpages repository instead. If the kernel supports the watching of keys, then running "make test" on that branch will cause the testing infrastructure to spawn a monitoring process on the side that monitors a notifications pipe for all the key/keyring changes induced by the tests and they'll all be checked off to make sure they happened. https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log/?h=pipe-watch - A test program is provided (samples/watch_queue/watch_test) that can be used to monitor for keyrings, mount and superblock events. Information on the notifications is simply logged to stdout" * tag 'notifications-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: smack: Implement the watch_key and post_notification hooks selinux: Implement the watch_key security hook keys: Make the KEY_NEED_* perms an enum rather than a mask pipe: Add notification lossage handling pipe: Allow buffers to be marked read-whole-or-error for notifications Add sample notification program watch_queue: Add a key/keyring notification facility security: Add hooks to rule on setting a watch pipe: Add general notification queue support pipe: Add O_NOTIFICATION_PIPE security: Add a hook for the point of notification insertion uapi: General notification queue definitions
2020-06-12smb3: Add debug message for new file creation with idsfromsid mount optionSteve French2-2/+4
Pavel noticed that a debug message (disabled by default) in creating the security descriptor context could be useful for new file creation owner fields (as we already have for the mode) when using mount parm idsfromsid. [38120.392272] CIFS: FYI: owner S-1-5-88-1-0, group S-1-5-88-2-0 [38125.792637] CIFS: FYI: owner S-1-5-88-1-1000, group S-1-5-88-2-1000 Also cleans up a typo in a comment Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-06-12Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespaceLinus Torvalds3-3/+3
Pull proc fix from Eric Biederman: "Much to my surprise syzbot found a very old bug in proc that the recent changes made easier to reproce. This bug is subtle enough it looks like it fooled everyone who should know better" * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Use new_inode not new_inode_pseudo
2020-06-12proc: Use new_inode not new_inode_pseudoEric W. Biederman3-3/+3
Recently syzbot reported that unmounting proc when there is an ongoing inotify watch on the root directory of proc could result in a use after free when the watch is removed after the unmount of proc when the watcher exits. Commit 69879c01a0c3 ("proc: Remove the now unnecessary internal mount of proc") made it easier to unmount proc and allowed syzbot to see the problem, but looking at the code it has been around for a long time. Looking at the code the fsnotify watch should have been removed by fsnotify_sb_delete in generic_shutdown_super. Unfortunately the inode was allocated with new_inode_pseudo instead of new_inode so the inode was not on the sb->s_inodes list. Which prevented fsnotify_unmount_inodes from finding the inode and removing the watch as well as made it so the "VFS: Busy inodes after unmount" warning could not find the inodes to warn about them. Make all of the inodes in proc visible to generic_shutdown_super, and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo. The only functional difference is that new_inode places the inodes on the sb->s_inodes list. I wrote a small test program and I can verify that without changes it can trigger this issue, and by replacing new_inode_pseudo with new_inode the issues goes away. Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com Fixes: 0097875bd415 ("proc: Implement /proc/thread-self to point at the directory of the current thread") Fixes: 021ada7dff22 ("procfs: switch /proc/self away from proc_dir_entry") Fixes: 51f0885e5415 ("vfs,proc: guarantee unique inodes in /proc") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-06-12ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlerszhangyi (F)2-16/+17
In the ext4 filesystem with errors=panic, if one process is recording errno in the superblock when invoking jbd2_journal_abort() due to some error cases, it could be raced by another __ext4_abort() which is setting the SB_RDONLY flag but missing panic because errno has not been recorded. jbd2_journal_commit_transaction() jbd2_journal_abort() journal->j_flags |= JBD2_ABORT; jbd2_journal_update_sb_errno() | ext4_journal_check_start() | __ext4_abort() | sb->s_flags |= SB_RDONLY; | if (!JBD2_REC_ERR) | return; journal->j_flags |= JBD2_REC_ERR; Finally, it will no longer trigger panic because the filesystem has already been set read-only. Fix this by introduce j_abort_mutex to make sure journal abort is completed before panic, and remove JBD2_REC_ERR flag. Fixes: 4327ba52afd03 ("ext4, jbd2: ensure entering into panic after recording an error in superblock") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200609073540.3810702-1-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-12cifs: fix chown and chgrp when idsfromsid mount option enabledSteve French1-15/+42
idsfromsid was ignored in chown and chgrp causing it to fail when upcalls were not configured for lookup. idsfromsid allows mapping users when setting user or group ownership using "special SID" (reserved for this). Add support for chmod and chgrp when idsfromsid mount option is enabled. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-06-12smb3: allow uid and gid owners to be set on create with idsfromsid mount optionSteve French4-21/+129
Currently idsfromsid mount option allows querying owner information from the special sids used to represent POSIX uids and gids but needed changes to populate the security descriptor context with the owner information when idsfromsid mount option was used. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2020-06-12ext4: support xattr gnu.* namespace for the HurdJan (janneke) Nieuwenhuizen4-1/+56
The Hurd gained[0] support for moving the translator and author fields out of the inode and into the "gnu.*" xattr namespace. In anticipation of that, an xattr INDEX was reserved[1]. The Hurd has now been brought into compliance[2] with that. This patch adds support for reading and writing such attributes from Linux; you can now do something like mkdir -p hurd-root/servers/socket touch hurd-root/servers/socket/1 setfattr --name=gnu.translator --value='"/hurd/pflocal\0"' \ hurd-root/servers/socket/1 getfattr --name=gnu.translator hurd-root/servers/socket/1 # file: 1 gnu.translator="/hurd/pflocal" to setup a pipe translator, which is being used to create[3] a vm-image for the Hurd from GNU Guix. [0] https://summerofcode.withgoogle.com/projects/#5869799859027968 [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3980bd3b406addb327d858aebd19e229ea340b9a [2] https://git.savannah.gnu.org/cgit/hurd/hurd.git/commit/?id=a04c7bf83172faa7cb080fbe3b6c04a8415ca645 [3] https://git.savannah.gnu.org/cgit/guix.git/log/?h=wip-hurd-vm Signed-off-by: Jan Nieuwenhuizen <janneke@gnu.org> Link: https://lore.kernel.org/r/20200525193940.878-1-janneke@gnu.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-12smb311: Add tracepoints for new compound posix query infoSteve French2-6/+6
Add dynamic tracepoints for new SMB3.1.1. posix extensions query info level (100) Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-12smb311: add support for using info level for posix extensions querySteve French4-4/+15
Adds calls to the newer info level for query info using SMB3.1.1 posix extensions. The remaining two places that call the older query info (non-SMB3.1.1 POSIX) require passing in the fid and can be updated in a later patch. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-12smb311: Add support for lookup with posix extensions query infoSteve French1-1/+3
Improve support for lookup when using SMB3.1.1 posix mounts. Use new info level 100 (posix query info) Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-12smb311: Add support for SMB311 query info (non-compounded)Steve French2-0/+15
Add worker function for non-compounded SMB3.1.1 POSIX Extensions query info. This is needed for revalidate of root (cached) directory for example. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-12SMB311: Add support for query info using posix extensions (level 100)Steve French6-2/+316
Adds support for better query info on dentry revalidation (using the SMB3.1.1 POSIX extensions level 100). Followon patch will add support for translating the UID/GID from the SID and also will add support for using the posix query info on lookup. Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-06-12smb3: add indatalen that can be a non-zero value to calculation of credit charge in smb2 ioctlNamjae Jeon1-1/+3
Some of tests in xfstests failed with cifsd kernel server since commit e80ddeb2f70e. cifsd kernel server validates credit charge from client by calculating it base on max((InputCount + OutputCount) and (MaxInputResponse + MaxOutputResponse)) according to specification. MS-SMB2 specification describe credit charge calculation of smb2 ioctl : If Connection.SupportsMultiCredit is TRUE, the server MUST validate CreditCharge based on the maximum of (InputCount + OutputCount) and (MaxInputResponse + MaxOutputResponse), as specified in section 3.3.5.2.5. If the validation fails, it MUST fail the IOCTL request with STATUS_INVALID_PARAMETER. This patch add indatalen that can be a non-zero value to calculation of credit charge in SMB2_ioctl_init(). Fixes: e80ddeb2f70e ("smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64K") Cc: Stable <stable@vger.kernel.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Cc: Steve French <smfrench@gmail.com> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Steve French <stfrench@microsoft.com>